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Abstract 

Let Y £ K™ be a random vector with mean s and covariance matrix a 2 P n *P n where 
P n is some known n x 71-matrix. We construct a statistical procedure to estimate s as 
well as under moment condition on Y or Gaussian hypothesis. Both cases are developed 
for known or unknown a 2 . Our approach is free from any prior assumption on s and is 
based on non-asymptotic model selection methods. Given some linear spaces collection 
{S m , m £ .M}, we consider, for any m G A4, the least-squares estimator s m of s in S m . 
Considering a penalty function that is not linear in the dimensions of the S m 's, we select 
some rh G M. in order to get an estimator §m with a quadratic risk as close as possible 
to the minimal one among the risks of the s m 's. Non-asymptotic oracle- type inequalities 
and minimax convergence rates are proved for s„. A special attention is given to the 
estimation of a non-parametric component in additive models. Finally, we carry out a 
simulation study in order to illustrate the performances of our estimators in practice. 

1 Introduction 
1.1 Additive models 

The general form of a regression model can be expressed as 

Z = f(X) + ae (1) 

where X = (X^\ . . . ,X^)' is the /c-dimensional vector of explanatory variables that belongs 
to some product space X = X\ x • • • X CM', the unknown function / : X — > K is called 
regression function, the positive real number a is a standard deviation factor and the real 
random noise e is such that E[e|X] = and E[e 2 |X] < 00 almost surely. 

In such a model, we are interested in the behavior of Z in accordance with the fluctuations 
of X. In other words, we want to explain the random variable Z through the function f(x) = 
M[Z\X = x]. For this purpose, many approaches have been proposed and, among them, a 
widely used is the linear regression 

k 

Z = ii + Y J frX [i) +ae (2) 
i=l 

where [i and the /3j's are unknown constants. This model benefits from easy interpretation 
in practice and, from a statistical point of view, allows componentwise analysis. However, a 
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drawback of linear regression is its lack of flexibility for modeling more complex dependencies 
between Z and the jW's. In order to bypass this problem while keeping the advantages of 
models like ([2]), we can generalize them by considering additive regression models of the form 

k 

Z = » + Y l fi(X®) + <jE ( 3 ) 

i=l 

where the unknown functions /j : Xi — > M will be referred to as the components of the 
regression function /. The object of this paper is to construct a data-driven procedure for 
estimating one of these components on a fixed design (i.e. conditionally to some realizations 
of the random variable X). Our approach is based on nonasymptotic model selection and is 
free from any prior assumption on / and its components. In particular, we do not make any 
regularity hypothesis on the function to estimate except to deduce uniform convergence rates 
for our estimators. 

Models ([3|) are not new and were first considered in the context of input-output analysis by 
Leontief [23j and in analysis of variance by Scheffe [35] . This kind of model structure is widely 
used in theoretical economics and in econometric data analysis and leads to many well known 
economic results. For more details about interpret ability of additive models in economics, the 
interested reader could find many references at the end of Chapter 8 of |18| . 

As we mention above, regression models are useful for interpreting the effects of X on 
changes of Z. To this end, the statisticians have to estimate the regression function /. As- 
suming that we observe a sample {(X±, Z±), . . . , (X n , Z n )} obtained from model (JTj) , it is well 
known (see |37| ) that the optimal L 2 convergence rate for estimating / is of order n - a /(' 2a + k ) 
where a > is an index of smoothness of /. Note that, for large value of k, this rate becomes 
slow and the performances of any estimation procedure suffer from what is called the curse of 
the dimension in literature. In this connection, Stone [37j has proved the notable fact that, 
for additive models ([3]), the optimal L 2 convergence rate for estimating each component 
of / is the one-dimensional rate n~ a ^ 2a+1 \ In other terms, estimation of the component 
fi in © can be done with the same optimal rate than the one achievable with the model 
Z> = f i (XW) + ae. 

Components estimation in additive models has received a large interest since the eighties 
and this theory benefited a lot from the the works of Buja et al. |15j . Hastie and Tibshirani 
|19) . Very popular methods for estimating components in © are based on backfitting pro- 
cedures (see |12) for more details). These techniques are iterative and may depend on the 
starting values. The performances of these methods deeply depends on the choice of some 
convergence criterion and the nature of the obtained results is usually asymptotic (see, for 
example, the works of Opsomer and Ruppert }30| and Mammen, Linton and Nielsen |26j). 
More recent non-iterative methods have been proposed for estimating marginal effects of the 
XW on the variable Z (i.e. how Z fluctuates on average if one explanatory variable is varying 
while others stay fixed). These procedures, known as marginal integration estimation, were 
introduced by Tj0stheim and Auestad |38| and Linton and Nielsen [24] , In order to estimate 
the marginal effect of X^\ these methods take place in two times. First, they estimate the 
regression function / by a particular estimator /*, called pre- smoother, and then they average 
/* according to all the variables except X® . The way for constructing /* is fundamental 
and, in practice, one uses a special kernel estimator (see [34J and [36j for a discussion on this 
subject). To this end, one needs to estimate two unknown bandwidths that are necessary for 
getting /*. Dealing with a finite sample, the impact of how we estimate these bandwidths is 
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not clear and, as for backfitting, the theoretical results obtained by these methods are mainly 
asymptotic. 

In contrast with these methods, we are interested here in nonasymptotic procedures to 
estimate components in additive models. The following subsection is devoted to introduce 
some notations and the framework that we handle but also a short review of existing results 
in nonasymptotic estimation in additive models. 

1.2 Statistical framework 

We are interested in estimating one of the components in the model ([3]) with, for any i, 
Xi = [0, 1]. To focus on it, we denote by s : [0, 1] — > R the component that we plan to estimate 
and by t 1 , . . . ,t K : [0, 1] -> R the K ^ 1 other ones. Thus, considering the design points 
(xi,yj,...,7/f )',..., (xi,y{,...,yf)' G [0, l] K+1 , we observe 

K 

Zi = s(xi) + n + ^V(y-) + crej, i = 1, ... ,n , (4) 
j'=i 

where the components s^t 1 , . . . ,t are unknown functions, fi in an unknown real number, a 
is a positive factor and e = (e\, . . . , e n )' is an unobservable centered random vector with i.i.d. 
components of unit variance. 

Let v be a probability measure on [0, 1], we introduce the space of centered and square- 
integrable functions 

L2([0,L» = j/eL 2 ([0,L» : J f(t)u(dt) = o\ . 

Let vi,.,.,vk be K probability measures on [0,1], to avoid identification problems in the 
sequel, we assume 

s€L§([0,L» and ^€^([0,1],^), j = l,...,K . (5) 

This hypothesis is not restrictive since we are interested in how Z = (Z%, . . . , Z n )' fluctuates 
with respect to the x^s. A shift on the components does not affect these fluctuations and the 
estimation proceeds up to the additive constant \x. 

The results described in this paper are obtained under two different assumptions on the 
noise terms £j, namely 

(Hcau) the random vector e is a standard Gaussian vector in R n , 

and 

(Hmohi) the variables £j satisfy the moment condition 

3p > 2 such that Vi, t p = E [\ei\ p ] < oo . (6) 

Obviously, (Hmohi) is weaker than (Hcau)- We consider these two cases in order to illustrate 
how better are the results in the Gaussian case with regard to the moment condition case. 
From the point of view of model selection, we show in the corollaries of Section [2] that we are 
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allowed to work with more general model collections under (Hcau) than under (H^om) hi 
order to get similar results. Thus, the main contribution of the Gaussian assumption is to 
give more flexibility to the procedure described in the sequel. 

So, our aim is to estimate the component s on the basis of the observations For the sake 
of simplicity of this introduction, we assume that the quantity a 2 > is known (see Section 
[3] for unknown variance) and we introduce the vectors s = (si, . . . , s n )' and t = (ti, . . . , t n )' 
defined by, for any i £ {1, . . . , n}, 

K 

Si = s(xi) and U = fj, + } ■ (7) 

i=i 

Moreover, we assume that we know two linear subspaces E, F C M n such that s G E, t £ F 
and E © F = W 1 . Of course, such spaces are not available to the statisticians in practice and, 
when we handle additive models in Section we will not suppose that they are known. Let 
P n be the projection onto E along F, we derive from @ the following regression framework 

Y = P n Z = s + aP n e (8) 

where Y = (Yi, . . . , Y n )' belongs to E = Im(P n ) C M n . 

The framework (J3J) is similar to the classical signal-plus-noise regression framework but the 
data are not independent and their variances are not equal. Because of this uncommonness 
of the variances of the observations, we qualify ([8]) as an heteroscedastic framework. The 
object of this paper is to estimate the component s and we handle ([8]) to this end. The 
particular case of P n equal to the unit matrix has been widely treated in the literature (see, 
for example, |10] for (Hcau) and @] for (HMom))- The case of an unknown but diagonal 
matrix P n has been studied in several papers for the Gaussian case (see, for example, [16J and 
|17j). By using cross-validation and resampling penalties, Arlot and Massart [3] and Arlot [2] 
have also considered the framework ([8]) with unknown diagonal matrix P n . Laurent, Loubes 
and Marteau |21| deal with a known diagonal matrix P n for studying testing procedure in an 
inverse problem framework. The general case of a known non-diagonal matrix P n naturally 
appears in applied fields as, for example, genomic studies (see Chapters 4 and 5 of |33j). 

The results that we introduce in the sequel consider the framework ([8]) from a general 
outlook and we do not make any prior hypothesis on P n . In particular, we do not suppose 
that P n is invertible. We only assume that it is a projector when we handle the problem 
of component estimation in an additive framework in Section HI Without loss of generality, 
we always admit that s E Im(i-' n ). Indeed, if s does not belong to lm(P n ), it suffices to 
consider the orthogonal projection np n onto Im^Pn) 1 - and to notice that ftp n Y = Ttp n s is not 
random. Thus, replacing Y by Y — 7Tp n Y leads to ([8]) with a mean lying in Im(P n ). For 
general matrix P n , other approaches could be used. However, for the sake of legibility, we 
consider s £ Im(P n ) because, for the estimation of a component in an additive framework, by 
construction, we always have Y = P n Z € lm(P n ) as it will be specified in Section UJ 

We now describe our estimation procedure in details. For any z £ R n , we define the 
least-squares contrast by 

1 n 

Let us consider a collection of linear subspaces of Im(P n ) denoted by T = {S m , m £ Ai} 
where A4 is a finite or countable index set. Hereafter, the S m 's will be called the models. 
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Denoting by 7r m the orthogonal projection onto S m , the minimum of ^ n over S m is achieved 
at a single point s m = ir m Y called the least-squares estimator of s in S m . Note that the 
expectation of s m is equal to the orthogonal projection s m = ir m s of s onto S m . We have the 
following identity for the quadratic risks of the s m 's, 

Proposition 1.1. Let m G JVl, the least-squares estimator s m = ir m Y of s in S m satisfies 

E [II* - § m fn] =\\S~ s m f n + ^F^rnPn) a2 (g) 

where Tr(-) is the trace operator. 
Proof. By orthogonality we have 



s - s m\\n +0" 2 ||7T m P n e||^ . (10) 



Because the components of e are independent and centered with unit variance, we easily 
compute 

E [Htt^H = Tr(tPn ^ Pn) • 
L J n 

We conclude by taking the expectation on both side of (|10p . □ 

A "good" estimator is such that its quadratic risk is small. The decomposition given by 
shows that this risk is a sum of two non- negative terms that can be interpreted as follows. 
The first one, called bias term, corresponds to the capacity of the model S m to approximate 
the true value of s. The second, called variance term, is proportional to Tr(*P n 7T m P n ) and 
measures, in a certain sense, the complexity of S m . If S m = Ru, for some u E W 1 , then the 
variance term is small but the bias term is as large as s is far from the too simple model S m . 
Conversely, if S m is a "huge" model, whole W 1 for instance, the bias is null but the price is a 
great variance term. Thus, ([9]) illustrates why choosing a "good" model amounts to finding a 
trade-off between bias and variance terms. 

Clearly, the choice of a model that minimizes the risk ([9]) depends on the unknown vector s 
and makes good models unavailable to the statisticians. So, we need a data-driven procedure 
to select an index m G Ai such that E[||s — Sm|| 2 ] is close to the smaller L 2 -risk among the 
collection of estimators {s m , m £ Ai}, namely 

71(8,7)= inf E[\\s-s m f n ] . 

To choose such a rh, a classical way in model selection consists in minimizing an empirical 
penalized criterion stochastically close to the risk. Given a penalty function pen : Ai —> M+, 
we define rh as any minimizer over A4 of the penalized least-squares criterion 

rh G argmin{7 n (s m ) + pen(m)} . (11) 

This way, we select a model Sm and we have at our disposal the penalized least-squares esti- 
mator s = s m . Note that, by definition, the estimator s satisfies 

Vm G M, 7„(s) + pen(m) < J n (s m ) + pen(m) . (12) 
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To study the performances of s, we have in mind to upperbound its quadratic risk. To this 
end, we establish inequalities of the form 

E[||a-S||S] inf {\\s - s m f n + pen(m)} + - (13) 

where C and R are numerical terms that do not depend on n. Note that if the penalty is 
proportional to Tr( *P n 7r m P ri )(T 2 /n, then the quantity involved in the infimum is of order of the 
L 2 -risk of s m . Consequently, under suitable assumptions, such inequalities allow us to deduce 
upperbounds of order of the minimal risk among the collection of estimators {s m , m G Ai}. 
This result is known as an oracle inequality 

E[\\s-s\\ 2 n ] ^CK(s,T) = C inf E [\\s - s m \g] . (14) 

This kind of procedure is not new and the first results in estimation by penalized criterion 
are due to Akaike [1J and Mallows |25| in the early seventies. Since these works, model selection 
has known an important development and it would be beyond the scope of this paper to make 
an exhaustive historical review of the domain. We refer to the first chapters of |28j for a more 
general introduction. 

Nonasymptotic model selection approach for estimating components in an additive model 
was studied in few papers only. Considering penalties that are linear in the dimension of the 
models, Baraud, Comte and Viennet [6] have obtained general results for geometrically /3- 
mixing regression models. Applying it to the particular case of additive models, they estimate 
the whole regression function. They obtain nonasymptotic upperbounds similar to (|13p on 
condition e admits a moment of order larger than 6. For additive regression on a random 
design and alike penalties, Baraud [5] proved oracle inequalities for estimators of the whole 
regression function constructed with polynomial collections of models and a noise that admits 
a moment of order 4. Recently, Brunei and Comte |13j have obtained results with the same 
flavor for the estimation of the regression function in a censored additive model and a noise 
admitting a moment of order larger than 8. Pursuant to this work, Brunei and Comte |14j 
have also proposed a nonasymptotic iterative method to achieve the same goal. Combining 
ideas from sparse linear modeling and additive regression, Ravikumar et al. [32J have recently 
developed a data-driven procedure, called SpAM, for estimating a sparse high-dimensional 
regression function. Some of their empirical results have been proved by Meier, van de Geer 
and Biihlmann |29| in the case of a sub-Gaussian noise and some sparsity-smoothness penalty. 

The methods that we use are similar to the ones of Baraud, Comte and Viennet and are 
inspired from [4]. The main contribution of this paper is the generalization of the results of 
[4J and [6j to the framework ([8]) with a known matrix P n under Gaussian hypothesis or only 
moment condition on the noise terms. Taking into account the correlations between the ob- 
servations in the procedure leads us to deal with penalties that are not linear in the dimension 
of the models. Such a consideration naturally arises in heteroscedastic framework. Indeed, 
as mentioned in [2], at least from an asymptotic point of view, considering penalties linear in 
the dimension of the models in an heteroscedastic framework does not lead to oracle inequal- 
ities for s. For our penalized procedure and under mild assumptions on J 7 , we prove oracle 
inequalities under Gaussian hypothesis on the noise or only under some moment condition. 

Moreover, we introduce a nonasymptotic procedure to estimate one component in an addi- 
tive framework. Indeed, the works cited above are all connected to the estimation of the whole 
regression function by estimating simultaneously all of its components. Since these compo- 
nents are each treated in the same way, their procedures can not focus on the properties of 
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one of them. In the procedure that we propose, we can be sharper, from the point of view 
of the bias term, by using more models to estimate a particular component. This allows us 
to deduce uniform convergence rates over Holderian balls and adaptivity of our estimators. 
Up to the best of our knowledge, our results in nonasymptotic estimation of a nonparametric 
component in an additive regression model are new. 

The paper is organized as follows. In Section [2l we study the properties of the estimation 
procedure under the hypotheses (Hcau) an d (HMom) with a known variance factor a 2 . As a 
consequence, we deduce oracle inequalities and we discuss about the size of the collection T . 
The case of unknown a 2 is presented in Section [3] and the results of the previous section are 
extended to this situation. In Section [5J we apply these results to the particular case of the 
additive models and, in the next section, we give uniform convergence rates for our estimators 
over Holderian balls. Finally, in Section [UJ we illustrate the performances of our estimators 
in practice by a simulation study. The last sections are devoted to the proofs and to some 
technical lemmas. 

Notations: in the sequel, for any x = (x±, . . . , x n )' , y = (yi, . . . , y n )' E W 1 , we define 

^ n \ n 

IMIn = - Y] X i and ( x ,V)n = ~yZ X iVi ■ 
i=l i=l 

We denote by p the spectral norm on the set M n of the nxn real matrices as the norm induced 
b y II • \\n, 

II Arn II 

VA E M n , p(A) = sup 1 "" . 

xgM n \{o} IfIU 

For more details about the properties of p, see Chapter 5 of |20j. 



2 Main results 

Throughout this section, we deal with the statistical framework given by ([8]) with s G lm(P n ) 
and we assume that the variance factor a 2 is known. Moreover, in the sequel of this paper, 
for any d E N, we define as the number of models of dimension d in J-, 

N d = Card {m E M : dim(5 m ) = d} . 

We first introduce general model selection theorems under hypotheses (Hcau) an d (Hmohi)- 

Theorem 2.1. Assume that (Hcau) holds and consider a collection of nonnegative numbers 
{L m ,m E Ai}. Let 9 > 0, if the penalty function is such that 

penim) ^(1 + 6 + L m ) { - m n) a 2 for all m E M , (15) 

n 

then the penalized least-squares estimator s given by satisfies 

E[\\s-sf n ]^C inf \\\s - s m \\ 2 n + pen(m) - ^ + f?(P n )<r 2 

where we have set 

Rn(9) = C ^ exp -j— 

meM V P [ n) J 

and C > 1 and C',C" > are constants that only depend on 8. 
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If the errors are not supposed to be Gaussian but only to satisfy the moment condition 
(H]y[om)) the following upperbound on the q-th moment of \\s — s|| 2 holds. 

Theorem 2.2. Assume that (Hjyiom) holds and take q > such that 2(q + 1) < p. Consider 
9 > and some collection {L m , m G M} of positive weights. If the penalty function is such 
that 

Tr( t P i P] 

penim) ^(1 + 9 + L m ) { n m "V for all m € M , (17) 

n 

then the penalized least-squares estimator s given by satisfies 

E[\\s-s\\%] 1/q inf - s m \\l + pen{m)} + ^"^' ^(p, q, Of'* (18) 
where we have set R n (p,q,9) equal to 



C'Tp 



Tr( Pn^rnPn)\ ( L m Tr( P n 1T m P r ^^ 

m&M:S m ^{0} 



N + y, ( 1 + 



p 2 (-K m P n ) J V P 2 (Pn) 



and C = C(q, 6), C = C'(p, q, 9) are positive constants. 

The proofs of these theorems give explicit values for the constants C that appear in the 
upperbounds. In both cases, these constants go to infinity as 9 tends to or increases toward 
infinity. In practice, it does neither seem reasonable to choose 9 close to nor very large. Thus 
this explosive behavior is not restrictive but we still have to choose a "good" 9. The values for 
9 suggested by the proofs are around the unity but we make no claim of optimality. Indeed, 
this is a hard problem to determine an optimal choice for 9 from theoretical computations 
since it could depend on all the parameters and on the choice of the collection of models. In 
order to calibrate it in practice, several solutions are conceivable. We can use a simulation 
study, deal with cross-validation or try to adapt the slope heuristics described in |llj to our 
procedure. 

For penalties of order of Tr( *P n 7r m P n )a" 2 /n, Inequalities (|16p and (fT8|) are not far from 
being oracle. Let us denote by R n the remainder term R n {9) or R n (p,q,9) according to 
whether (Hcau) or (HMom) holds. To deduce oracle inequalities from that, we need some 
additional hypotheses as the following ones: 

(Ai) there exists some universal constant £ > such that 

i \ ^ A Tr( *P n 7T m P n ) 2 t u ^ . A 

pen(m) ^ Q a , for ail m G Jvl , 

n 

(A2) there exists some constant R > such that 

sup-R n ^ R , 

(A3) there exists some constant p > 1 such that 

sup p 2 (P n ) ^ p 2 ■ 
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Thus, under the hypotheses of Theorem 12.11 and these three assumptions, we deduce from 
(USD that 

Iff Til ~l|21^-f/ll „ 2 , Tr^PnTTmPn) 2 \ Rp 2 Q 2 

L J meM i n J n 

where C is a constant that does not depend on s, <r 2 and n. By Proposition ll.il this inequality 
corresponds to (fLfj) up to some additive term. To derive similar inequality from (|18p . we need 
on top of that to assume that p > 4 in order to be able to take q = 1. 

Assumption (A3) is subtle and strongly depends on the nature of P n . The case of oblique 
projector that we use to estimate a component in an additive framework will be discussed in 
Section^ Let us replace it, for the moment, by the following one 

(A 3 ) there exists c G (0, 1) that does not depend on n such that 

cp 2 (P n )dim(S m ) s$ Tr(P n vr m P n ) . 

By the properties of the norm p, note that Tr ( *P n TT m P n ) always admits an upperbound 
with the same flavor 

Tr(*P n 7r m P n ) = Tr(7r w P n *(7r m P n )) 

^ p(7r m P ri '(7r m P n ))rk(7r m P„*(7r m P ri )) 
^ p 2 (7r m P„)rk(7r 

in I 

< p 2 (P n ) dim(5 OT ) . 

In all our results, the quantity Tr(*P n 7r m P n ) stands for a dimensional term relative to S m . 
Hypothesis (A 3 ) formalizes that by assuming that its order is the dimension of the model S m 
up to the norm of the covariance matrix l P n P n . 

Let us now discuss about the assumptions (Ai) and (A2). They are connected and they 
raise the impact of the complexity of the collection T on the estimation procedure. Typically, 
condition (A2) will be fulfilled under (Ai) when T is not too "large", that is, when the 
collection does not contain too many models with the same dimension. We illustrate this 
phenomenon by the two following corollaries. 

Corollary 2.1. Assume that (Hcau) an d (A 3 ) hold and consider some finite A ^ such 
that 

logN d 

sup — - — ^ A . (19) 

d£W.N d >0 d 

Let L, 9 and uj be some positive numbers that satisfy 

T 2(1 + 0)3 

Then, the estimator s obtained from Ml\) with penalty function given by 

1 \ (-> , a , r\ ^K*P n 7T m P n ) 2 

penim) = 1 1 + & + L) a 

n 

is such that 

iB-rn ~ii2i^^-f/n iia ^(t ^^ Tr^My { cp 2 {P n )) 2 

where C > 1 onZy depends on 6, uj and c. 
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For errors that only satisfy moment condition, we have the following similar result. 

Corollary 2.2. Assume that (H^om) an d (A 3 ) hold with p > 6 and let A > and to > 

such that 

N < 1 and sup - — < A . (20) 

d>0:N d >0 (1 + d)f/ 2 " 3 - ^ ^ ' 

Consider some positive numbers L, 9 and to' that satisfy 

L > u'A 2 '^ , 

then, the estimator s obtained from Ul\) with penalty function given by 

I \ t-\ \ A \ t \ Tr^PnTtmPn) 2 

pen(m) = (1 + + L) a 

n 

is such that 

E[|.-.-|ia«*i, Lnf (||,-^||; + ( t Vl) ^' f * P -» V ^' F ~»^ 

where C > 1 only depends on 9, p, to, to' and c. 

Note that the assumption (A 3 ) guarantees that Tr( tPniimPn) is not smaller than cp 2 (P n ) dim(S„ 
and, at least for the models with positive dimension, this implies TV(*P n 7T m P n ) ^ cp 2 (P n ). 
Consequently, up to the factor L, the upperbounds of E — given by Corollaries 12.11 
and 12.21 are of order of the minimal risk lZ(s, J-). To deduce oracle inequalities for s from that, 
(Ai) needs to be fulfilled. In other terms, we need to be able to consider some L independently 
from the size n of the data. It will be the case if the same is true for the bounds A. 

Let us assume that the collection T is small in the sense that, for any d £ N, the number 
of models Nd is bounded by some constant term that neither depends on n nor d. Typically, 
collections of nested models satisfy that. In this case, we are free to take L equal to some 
universal constant. So, (Ai) is true for £ = 1 + 9 + L and oracle inequalities can be deduced 
for s. Conversely, a large collection J- is such that there are many models with the same 
dimension. We consider that this situation happens, for example, when the order of A is 
log n. In such a case, we need to choose L of order log n too and the upperbounds on the risk 
of s become oracle type inequalities up to some logarithmic factor. However, we know that in 
some situations, this factor can not be avoided as in the complete variable selection problem 
with Gaussian errors (see Chapter 4 of |27j). 

As a consequence, the same model selection procedure allows us to deduce oracle type 
inequalities under (Hcau) an d (Hmohi)- Nevertheless, the assumption on in Corollary 
12.21 is more restrictive than the one in Corollary 12.11 Indeed, to obtain an oracle inequality in 
the Gaussian case, the quantity is limited by e Ad while the bound is only polynomial in 
d under moment condition. Thus, the Gaussian assumption (Hcau) allows to obtain oracle 
inequalities for more general collections of models. 

3 Estimation when variance is unknown 

In contrast with Section [2J the variance factor a 2 is here assumed to be unknown in ([5]). Since 
the penalties given by Theorems 12.11 and 12.21 depend on a 2 , the procedure introduced in the 
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previous section does not remain available to the statisticians. Thus, we need to estimate a 2 
in order to replace it in the penalty functions. The results of this section give upperbounds 
for the L 2 -risk of the estimators s constructed in such a way. 

To estimate the variance factor, we use a residual least-squares estimator a 2 that we define 
as follows. Let V be some linear subspace of Im(P n ) such that 

Tr( f P n vrP n ) ^ Tr( 4 P n P n )/2 (21) 

where ir is the orthogonal projection onto V. We define 

~2 = n\\Y-irY\\l 

Tr(*P n (/ n -7r)P n ) ' 1 ; 

First, we assume that the errors are Gaussian. The following result holds. 

Theorem 3.1. Assume that (Hcau) holds. For any 9 > 0, we define the penalty function 

VmeM, pen(m) = (1 + 9) ^ tPn * mP ^ <j 2 . (23) 

n 

Then, for some positive constants C , C and C" that only depend on 9, the penalized least- 
squares estimator s satisfies 

E []| -s — ~s\\l] < C ( inf E [|| 8 - s m \\ 2 n ] + \\s - 7rs\\ 2 n ) + ^^R n {9) (24) 



meM In 



where we have set 

|2 



Rn{0) = C 



2+ 



P 2 (PuW 



( 9 2 Tr( t P n P n )\ ^ / „ Tr{ ^n^mPn) \ 

exp {-^mr) + I> P v c 7m J 



If the errors are only assumed to satisfy a moment condition, we have the following theorem. 

Theorem 3.2. Assume that (Hmohi) holds. Let 9 > 0, we consider the penalty function 
defined by 

Vm G M, penim) = (1 + 9) T < tp ^mPn) &2 (25) 

n 

For any < q ^ 1 such that 2{q + 1) < p, the penalized least-squares estimator s satisfies 

- s\\n] 1/q <C( inf E[||a - s m \\ 2 n ] + 2\\s - 7rs\£) + p 2 (P n )a 2 R n (p, q, 9) 
where C = C(q,6) and C = C'(p,q,9) are positive constants, R n (p,q,9) is equal to 

+ p Kn \p 2 (P n )cT 2+Tp ){Tr^P n P n r P ) 

with R n (p,q,9) defined as in Theorem \2.£h {n n )n&i = ( K n(p> 9> #))neN is a sequence of positive 
numbers that tends to k = n(p,q,9) > as 7V(*P n P n ) / p 2 (P n ) increases toward infinity and 

a p = (p/2 - 1) V 1 and /3 p = (p/2 - 1) A 1 . 
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Penalties given by (|23p and ([25|) are random and allow to construct estimators s when 
a 2 is unknown. This approach leads to theoretical upperbounds for the risk of s. Note that 
we use some generic model V to construct a 2 . This space is quite arbitrary and is pretty 
much limited to be an half-space of Im(P n ). The idea is that taking V as some "large" space 
can lead to a good approximation of the true s and, thus, Y — ttY is not far from being 
centered and its normalized norm is of order a 2 . However, in practice, it is known that the 
estimator a 2 inclined to overestimate the true value of a 2 as illustrated by Lemmas 18.41 and 
18.51 Consequently, the penalty function tends to be larger and the procedure overpenalizes 
models with high dimension. To offset this phenomenon, a practical solution could be to 
choose some smaller 6 when a 2 is unknown than when it is known as we discuss in Section |6j 

4 Application to additive models 

In this section, we focus on the framework ([¥]) given by an additive model. To describe the 
procedure to estimate the component s, we assume that the variance factor a 2 is known but 
it can be easily generalized to the unknown factor case by considering the results of Section 
[3 We recall that s £ L§([0, 1], v), P £ Lq([0, 1], Uj), j = l,...,K, and we observe 

Zi = Si + ti + crej, i = 1, . . . , n , (26) 

where the random vector e = (e±, . . . ,e n )' is such that (Hcau) or (Hiviom) holds and the 
vectors s = (si, . . . , s n )' and t = (tx, . . . , t n )' are defined in ([7]). 

Let S n be a linear subspace of Lq([0, 1], z/) and, for all j 6 {1, . . . , K}, Sh be a linear 
subspace of Lq([0, 1], Vj). We assume that these spaces have finite dimensions D n = dim(5 n ) 
and Dn = dim(5^) such that 

D n + + ■■■ + D n V < „ . 

We consider an orthonormal basis . . . , (j>D n } (resp. {ifi^ , . . . ,ip L)}) of S n (resp. S? n ) 

equipped with the usual scalar product of L 2 ([0, (resp. of L 2 ([0, 1], Vj))- The linear 

spans E,F\...,F K Cl" are defined by 

E = Span {(^(xi), . . . ,<t>i(x n ))', i = 1, . . . ,£>„} 

and 

F* = Span {(^ (y{), . . . ,#(^))', i = 1, . . .,D$] ,j = l,...,K. 

Let l n = (1, . . . , 1)' G R n , we also define 

F = Rl n + F 1 + ■ ■ ■ + F K 

where Rl n is added to the F J, s in order to take into account the constant part ^ of Q. 
Furthermore, note that the sum defining the space F does not need to be direct. 

We are free to choose the functions <piS and itf's. In the sequel, we assume that these 
functions are chosen in such a way that the mild assumption E n F = {0} is fulfilled. Note 
that we do not assume that s belongs to E neither that t belongs to F. Let G be the space 
(E + F)^, we obviously have E © F © G = W 1 and we denote by P n the projection onto E 
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along F + G. Moreover, we define tte and ttf+g as the orthogonal projections onto E and 
F + G respectively. Thus, we derive the following framework from (|26p . 

Y = P n Z = s + aP n e (27) 

where we have set 

S — Pn§ ~t~ Pfi^ 

= s + (P n - I n )s + P n t 

= s+(P n - I n )(s - ir E s) + P n (t - TTF+ct) = s + h . 

Let J- = {S m , m£Ai}bea finite collection of linear subspaces of E, we apply the procedure 
described in Section [2] to Y given by ([27p . that is, we choose an index m 6 Ai as a minimizer 
of (|lip with a penalty function satisfying the hypotheses of Theorems 12,11 or 12.21 according 
to whether (Hcau) or (HMom) holds. This way, we estimate s by s. From the triangular 
inequality, we derive that 

E[\\s-s\\l}^2E[\\s-s\\l] + 2\\hf n . 

As we discussed previously, under suitable assumptions on the complexity of the collection F , 
we can assume that (Ai) and (A2) are fulfilled. Let us suppose for the moment that (A3) 
is satisfied for some p > 1. Note that, for any m £ Ai, 7r m is an orthogonal projection onto 
the image set of the oblique projection P n . Consequently, we have Tr( P n 7T TO P n ) rk(7r m ) = 
dim(S' m ) and Assumption (A3) implies (A 3 ) with c = 1/p 2 - Since, for all m £ Ai, 

\\s 7Tm^||ft ^ ||^ T^m^Hro 11^ TTm^Hn ^ ||^ ^"m^Hn ~t~ ll^lln ; 

we deduce from Theorems l2.1l or l2.2l that we can find, independently from s and n, two positive 
numbers C and C such that 

E[\\s-i\\t\ZC M \\\s - ir m s\\l + T < tp ^ p n) a 2\ + c , L hf + fW R \ (28) 

To derive an interesting upperbound on the L 2 -risk of s, we need to control the remainder 
term. Because p(-) is a norm on M n , we dominate the norm of h by 

\\h\\n ^ P{ln-Pn)\\s 'RES \\n + P{Pn)\\t - T^F+gAU 

~ 7r _B'S||n + \\t ~ ^F+G^lln) 
(1 + p)(\\s - ir E s\\ n + \\t ~ TT F+G t\\ n ) . 

Note that, for any m G Ai, S m C E and so, \\s — TTEs\\ n ^ ||s — ^m^Hn- Thus, Inequality (1281) 
leads to 

E[\\s-S\\l] < C{l+pf inf (||, - n m s\\ 2 n + T ^ tPn7TmPn) a 2 )+C'(l+p) 2 (\\t - n F+G t\\ 2 n + £r) 
mi .V! [ n J \ n J 

(29) 

The space F + G has to be seen as a large approximation space. So, under a reasonable 
assumption on the regularity of the component t, the quantity ||t — irp + ct\\n could be regarded 
as being neglectable. It mainly remains to understand the order of the multiplicative factor 

(1 + P) 2 - 
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Thus, we now discuss about the norm p(P n ) arid the assumption (A3). This quantity 

depends on the design points (xi, y\, . . . , yf) € [0, and on how we construct the spaces 

(i) 

E and F, i.e. on the choice of the basis functions fa and tpY ■ Hereafter, the design points 
(xj, yj, . . . , yf) will be assumed to be known independent realizations of a random variable on 
[0, with distribution v®v\(3- • -®Vk. We also assume that these points are independent 
of the noise e and we proceed conditionally to them. To discuss about the probability for (A3) 
to occur, we introduce some notations. We denote by D' the integer 



D' n = l + D^ + ... + D^ 
and we have dim(F) ^ D' n . Let A be a p x p real matrix, we define 

{V V V 
y^y^ lojaji x 1 a? 1 : yz ^ 1 
i=l j=l i=l 

Moreover, we define the matrices V(fa) and B{fa) by 



-I 

Vij(<fi) = \l I fa(x) 2 fa(x) 2 v(dx) and By ((f)) = sup \fa(x)(f)j(x)\ , 

'0 x£[0,l] 

for any 1 ^ i, j ^ Z) n . Finally, we introduce the quantities 

La = max{rjj (V((j>)), r Dn (B(4>))} and = max sup \fa(x)\ 

i=l,.».-DT, xe [o,l] 

and 

L n = max jL^, D n D' n , b^y/nDnD'^ . 



Proposition 4.1. Consider the matrix P n defined in l\27\) . We assume that the design points 
are independent realizations of a random variable on [0, l]' f£ ' +1 with distribution v^u\® - ■ -®vk 
such that we have EOF = {0} and dim(-E) = D n almost surely. If the basis {(pi, . . . , 4>D n } is 
such that 

VI < i < D n , / fa(x)v(dx) = (30) 



o 



then, there exists some universal constant C > such that, for any p > 1, 

P (p(P n ) >p)^ 4D n (D n + D' n ) exp ("^U - P" 1 ) 2 ) • 

As a consequence of Proposition 14.11 we see that (A3) is fulfilled with a large probability 
since we choose basis functions fa in such a way to keep L n small in front of n. It will be so for 
localized bases (piecewise polynomials, orthonormal wavelets, ...) with L n of order of n 1_w , for 
some to G (0, 1), once we consider D n and D' n of order of n3~T (this is a direct consequence of 
Lemma 1 in [8]). This limitation, mainly due to the generality of the proposition, could seem 
restrictive from a practical point of view. However the statistician can explicitly compute 
p(P n ) with the data. Thus, it is possible to adjust D n and D' n in order to keep p(P n ) small in 
practice. Moreover, we will see in Section [6] that, for our choices of fa and ipj, we can easily 
consider D n and D' n of order of yjn as we keep p(P n ) small (concrete values are given in the 
simulation study). 
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5 Convergence rates 



The previous sections have introduced various upperbounds on the L 2 -risk of the penalized 
least-squares estimators s. Each of them is connected to the minimal risk of the estimators 
among a collection {s m ,m E A4}. One of the main advantages of such inequalities is that 
it allows us to derive uniform convergence rates with respect to many well known classes 
of smoothness (see [7]). In this section, we give such results over Holderian balls for the 
estimation of a component in an additive framework. To this end, for any a > and R > 0, 
we introduce the space rl a (R) of the a-H61derian functions with constant R > on [0, 1], 

H a (R) = {/ : [0, 1] -> R : Vx, y € [0, 1], \f(x) - f(y)\ ^ R\x - y\ a } . 

In order to derive such convergence rates, we need a collection of models T with good ap- 
proximation properties for the functions of rl a (R). We denote by P BM an y oblique projector 
defined as in the previous section and based on spaces S n and Sh that are constructed as one of 
the examples given in Section 2 of [9j. In particular, such a construction allows us to deal with 
approximation spaces S n and S n that can be considered as spaces of piecewise polynomials, 
spaces of orthogonal wavelet expansions or spaces of dyadic splines on [0, 1]. We consider the 
dimensions D n = dim(5 n ) and, for any j € {1, . . . ,K}, Dn^ = dim(5n) = D n /K. Finally, we 
take a collection of models J^ BM that contains subspaces of E = lm(P n 3M ) as Baraud did in 
Section 2.2 of [BJ. 

Proposition 5.1. Consider the framework ^ and assume that (Hcau) or (Hiviom) holds 
with p > 6. We define Y in IffTjJ with P^^ 1 - Let r\ > and s be the estimator selected by the 
procedure Ml\) applied to the collection of models T BM with the penalty 

rp r (tpBM„ pBM\ 
I \ /-. . \ J -'\ r n % m r n ) 2 

penvm) = ( 1 + n) a . 

n 

Suppose that (A3) is fulfilled, we define 

For any a > Q n and R > 0, the penalized least-squares estimator s satisfies 

sup E M [|| S - s\\ 2 n ] ^ C a n- 2a ^ a+ V (31) 

{s,t\...,t K )£'H a {R) K + 1 

where E £jC ; is the expectation on e and on the random design points and C a > 1 only depends 
on a, p, a 2 , K, L, 9 and p (under (H^om) only). 

Note that the supremum is taken over Holderian balls for all the components of the regres- 
sion function, i.e. the regression function is itself supposed to belong to an Holderian space. 
As we mention in the introduction, Stone |37] has proved that the rate of convergence given 
by (|3ip is optimal in the minimax sense. 



6 Simulation study 

In this section, we study simulations based on the framework given by ([5]) with K + 1 com- 
ponents Sji 1 , . . . ,t and Gaussian errors. First, we introduce the spaces S n and Sn, j G 
{1, . . . , K}, and the collections of models that we handle. Next, we illustrate the perfor- 
mances of the estimators in practice by several examples. 
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6.1 Preliminaries 



To perform the simulation study, we consider two collections of models. In both cases, we 
deal with the same spaces S n and Sh defined as follows. Let ip be the Haar wavelet's mother 
function, 

{1 if0^x<l/2, 
-1 ifl/2<x<l, 
otherwise. 

For any i G N and j G {0, . . . , 2 l — 1}, we introduce the functions 

tpij(x) = 2 i / 2 <p(2 i x -j), 

It is clear that these functions are orthonormal in Lq([0, 1], dx) for the usual scalar product. 
Let d n be some positive integer, we consider the space S n C LqQO, l],dx) generated by the 
functions (fij such that ^ i ^ d n and ^ j < 2 % . The dimension of this space is dim(5 n ) = 
D n = 2 dn+1 — 1. In the sequel, we denote by II n the set of all the allowed pairs 

n n = G N 2 such that < i < d n , sC j < 2*} . 

Moreover, for any k G {1, . . . , D n } such that k = 2 % + j with S II n , we denote (j>k = f>%y 
Let d' n be an other positive integer, the spaces S? n C Lg([0, 1], dyi) are all supposed to be 
generated by the functions defined on [0, 1] by 



ihi{y) = ^2i(v) = sin(ivry) and ip2i-l(y) = ^2i-l(y) = cos (. i7T y) 



2d' and 



for any i G {1, . . . ,d' n } and j G {1, . . . ,K}. Thus, we have dim(5^,) 
D' n = 2Kd' n + 1. 

As previously, we define P n as the oblique projector onto E along F + (E + F)^. The 
image set E = Im(P n ) is generated by the vectors 

<Pi,j = (tPi,j{xi), ■ ■ ■ ,tPi,j{xn))' G M n , G n n . 

Let m be a subset of II n , the model S m is defined as the linear subspace of E generated by 
the vectors tpij with G m. 

In the following simulations, we always take D n and D' n close to Ay/n, i. e. 



d n 



log(2^+l/2) 
log(2) 



and d' 



Ay/n-1 
2K 



where, for any x G R, [^J denotes the largest integer not greater than x. For such choices, 
basic computations lead to L n of order of n 5//4 in Proposition 14.11 As a consequence, this 
proposition does not ensure that (A3) is fulfilled with a large probability. However, p{P n ) 
remains small in practice as we will see and it allows us to deal with larger collections of 
models. 

6.2 Collections of models 

The first collection of models is the smaller one because the models are nested. Let us introduce 
the index subsets, for any i G {0, . . . , d n }, 



o^j <2 1 } cn„ 
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Thus, we define F N as 



f k \ 

< S m such that 3k £ {0, ... , d n }, m = rrii > . 
I i=0 J 



This collection has a small complexity since, for any d £ N, ^ 1. According to Corollary 
12. 1| we can consider the penalty function given by 

pen N (m) = (1 + C) a (32) 

n 

for some C > 0. In order to compute the selected estimator s, we simply compute s m in each 
model of F N and we take the one that minimizes the penalized least-squares criterion. 

The second collection of models is larger than J 7 ^. Indeed, we allow m to be any subset 
of il n and we introduce 

F c = {S m such that m C Il n } . 
The complexity of this collection is large because, for any d € N, 

(D n \ D n \ feD n 

Nd={ , = ^ < 



d J d\(D n -d)\ \ d 

So, we have log Nj, ^ d(l + log D n ) and, according to Corollary 12. 1| we take a penalty function 

as 

pen c (m) = (1 + C + log D n ) Tl( a 2 (33) 

n 

for some C > 0. The large number of models in leads to difficulties for computing the 
estimator s. Instead of exploring all the models among T c , we break the penalized criterion 
down with respect to an orthonormal basis <j>\, . , . , 4>D n of E and we get 



D 2 



+ (l + C + logL»„) cT 

n 



Y - J2(Y, 

i=l 

Yf n ~ [< y > ~ (1 + C + log £> n )|| 'Pn^iH V 1 

8=1 



In order to minimize the penalized least-squares criterion, we only need to keep the coefficients 
(Y, 4>i) n that are such that 

{Y,^) 2 ^ {1 + C + \ogD n )\\ t P n 4> i \\W ■ 

This threshold procedure allows us to compute the estimator s in reasonable time. 

In accordance with the results of Section [3j in the case of unknown variance, we substitute 
a 2 for a 2 in the penalties ([32]) and p3j) . 



6.3 Numerical simulations 

We now illustrate our results and the performances of our estimation procedure by applying 
it to simulated data 

K 

Zi = s{xi) + '^2t j (yj) + ere*, i = 1, . . . ,n , 
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where K ^ 1 is an integer that will vary from an experiment to an other, the design points 
i,yj, . . . , yfy are known independent realizations of an uniform random variable on [0, 1] +1 



and the errors £j are i.i.d. standard Gaussian random variables. We handle this framework 
with known or unknown variance factor a 2 = 1 according to the cases and we consider a design 
of size n = 512. The unknown components s, i 1 , . . . ,t K are either chosen among the following 
ones, or set to zero in the last subsection, 



fl(x) = sin (V (x A f 2 (x) = cos Uir (x - ^ j - C 2 f 3 {x) = x + 2 exp(-16a; 2 ) - Q 



/ 4 (x) = sin(2x) + 2exp(-16x 2 )-C 4 h{x) = \ \ J M ]° {x ^ f 6 (x) = 6x(l - x) 

1 + exp( — 10{x — 1/2)) 

where the constants C2, C3 and C4 are such that fi G Lod^j l],dx) for any i E {1, . . . , 6}. 

The first step of the procedure consists in computing the oblique projector P n and taking 
the data Y = P n Z. Figure [1] gives an example by representing the signal s, the data Z and 
the projected data Y for K = 6, s = f\ and P = fj, j G {1, . . . ,6}. In particular, for this 
example, we have p 2 (P n ) = 1.22. We see that we actually get reasonable value of p 2 (P n ) with 
our particular choices for D n and D' n . 




0.0 0.2 0.4 0.6 0.8 1.0 



Figure 1: Plot in (x, z) of the signal s (dashed line), the data Z (dots) and the projected data 
Y (plain line). 

In order to estimate the component s, we choose rh by the procedure (jlip with penalty 



18 



function given by (|32|) or (|33p according to the cases. The first simulations deal with the 
collection F N of nested models. Figure [2] represents the true s and the estimator s for K = 6 
parasitic components given by t 3 = fj, j £ {1, ... ,6} and s = fx or s = fa. The penalty 
function (I32p has been used with a constant C = 1.5. 



Figure 2: Estimation of s (dashed) by s (plain) with J 7 ^, K = 6 and i J ; = fj, j G {1, . . . , 6}, 
for s = /i (left, p(P n ) = 1.24) and for a = f 5 (right, p(P n ) = 1.25). 

The second set of simulations is related to the large collection T c and to the penalty 
function (|33[) with C = 4.5. Figure [3] illustrates the estimation of s = fx and s = fa with 
K = 6 parasitic components f = /j, j € {1, . . . , 6} . 

In both cases, we see that the estimation procedure behaves well and that the norms p(P n ) 
are close to one in spite of the presence of the parasitic components. Moreover, note that the 
collection T c allows to get estimators that are sharper because they detect constant parts of 
s. This advantage leads to a better bias term in the quadratic risk decomposition at the price 
of the logarithmic term in the penalty (j33jl . 



6.4 Ratio estimation 

In Section^ we discussed about assumptions that ensure a small remainder term in Inequality 
(|29p . This result corresponds to some oracle type inequality for our estimation procedure of 
a component in an additive framework. We want to evaluate how far E [||s — s||^] is from the 
oracle risk. Thus, we estimate the ratio 

,~ E[||*-g||ji] 
ticks) = — 



• r J II II 2 , Tr( Pn^mPn) 2 

ml s-s m U a 



meM n 
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Figure 3: Estimation of s (dashed) by s (plain) with T c , K = 6 and P = fj, j G {1, . . . , 6}, 
for a = fi (left, p{P n ) = 1.23) and for s = f 2 (right, p{P n ) = 1.27). 

by repeating 500 times each experiment for various values of K and C. For each set of 
simulations, the parasitic components are taken such that P = fj, j € {1, . . . , K}, the values 
of p{P n ) are given and the variance a 2 is either assumed to be known or not. 

Table[T] (resp. Table [2} gives the values of r^(s) obtained for s = f\ (resp. s = fe) with 
the collection J rN and the penalty (l32|) . We clearly see that taking C close to zero or too large 
is not a good thing for the procedure. In our examples, C = 1.5 give good results and we 
get reasonable values of rx(s) for other choices of C between 1 and 3 for known or unknown 
variance. As expected, we also note that the values of p(P n ) and tk{s) tend to increase when 
K goes up but remain acceptable for K G {1, . . . , 6}. 

In the same way, we estimate the ratio rR-(s) for s = f\ and s = with the collection 
T c and the penalty (|33p . The results are given in Tableland Table [H We obtain reasonable 
values of r^-(s) for choices of C larger than what we took in the nested case. This phenomenon 
is related to what we mentioned at the end of Section [2j Indeed, for large collection of models, 
we need to overpenalize in order to keep the remainder term small enough. Moreover, because 
<r 2 tends to overestimate a 2 (see Section [3]), we see that we can consider smaller values for C 
when the variance is unknown than when it is known for obtaining equivalent results. 

6.5 Parasitic components equal to zero 

We are now interested in the particular case of parasitic components t 3 equal to zero in @, 
i.e. data are given by 

Zi = s(xi) + aEi, i = 1, . . . , n . 
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1.26 


1.19 


1.13 


1.11 


1.09 


1.08 


1.08 


1.08 


1.07 


K = 6, 
p{P n ) = 1.27 


3.14 


1.77 


1.29 


1.21 


1.17 


1.13 


1.12 


1.10 


1.10 


1.09 


1.09 


1.66 


1.40 


1.26 


1.18 


1.14 


1.13 


1.11 


1.11 


1.10 


1.10 


1.09 



Table 1: Ratio tk{s) for the estimation of s = /i with T . Each pair of lines corresponds to 
a value of K with the known a 2 case on the first line and unknown a 2 case on the second one. 

If we know that these K components are zero and if we deal with the collection T N and a 
known variance a 2 , we can consider the classical model selection procedure given by 

m E argmin \\\Z - 7T m Z\\ 2 n + g dim ^ A . (34) 

Then, we can define the estimator §0 = ^m Z. This procedure is well known and we refer to 
[27] for more details. If we do not know that the K parasitic components are null, we can 
use our procedure to estimate s by s. In order to compare the performances of s and sq with 
respect to the number K of zero parasitic components, we estimate the ratio 

( ~ ~ s n\s-~s\\i] 

rK{s,s ) = — - — Tjr 

for various values of K and C by repeating 500 times each experiment. 

The obtained results are given in Tables [5] and [6] for s = f\ and s = /g respectively. 
Obviously, the ratio rjc(s, sq) is always larger than one because the procedure (|34p makes 
good use of its knowledge about nullity of the t 3 . Nevertheless, we see that our procedure 
performs nearly as well as (j34"|) even for a large number of zero components. Indeed, for 
K G { 1, . . . , 9}, do not assuming that we know that the P are zero only implies a loss between 
1% and 10% for the risk. Such a loss remains acceptable in practice and allows us to consider 
more general framework for estimating s. 

7 Proofs 

In the proofs, we repeatedly use the following elementary inequality that holds for any a > 

and 1,1/6 1, 

2\xy\ < ax 2 + a~ 1 y 2 . (35) 
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c 


0.0 


0.5 


1.0 


1.5 


2.0 


2.5 


3.0 


3.5 


4.0 


4.5 


5.0 


K = l, 
p(P n ) = 1.28 


4.08 


1.52 


1.22 


1.20 


1.27 


1.35 


1.45 


1.56 


1.64 


1.70 


1.79 


3.44 


1.58 


1.36 


1.26 


1.30 


1.37 


1.45 


1.55 


1.64 


1.72 


1.81 


K = 2, 
p{P n ) = 1.23 


4.07 


1.66 


1.28 


1.26 


1.32 


1.40 


1.49 


1.57 


1.66 


1.74 


1.82 


2.29 


1.69 


1.36 


1.32 


1.36 


1.44 


1.53 


1.60 


1.65 


1.73 


1.82 


K = 3, 
p(P n ) = 1.25 


4.17 


1.65 


1.36 


1.34 


1.42 


1.50 


1.60 


1.67 


1.77 


1.89 


2.01 


2.24 


1.70 


1.41 


1.41 


1.48 


1.55 


1.61 


1.71 


1.80 


1.92 


2.01 


K = 4, 
p{P n ) = 1.26 


4.42 


1.88 


1.43 


1.34 


1.36 


1.45 


1.53 


1.61 


1.69 


1.77 


1.86 


3.80 


1.75 


1.51 


1.42 


1.44 


1.50 


1.56 


1.66 


1.75 


1.84 


1.93 


K = 5, 
p(P n ) = 1.26 


4.57 


1.82 


1.43 


1.37 


1.39 


1.46 


1.53 


1.60 


1.67 


1.76 


1.83 


2.33 


1.77 


1.51 


1.43 


1.44 


1.50 


1.54 


1.64 


1.74 


1.82 


1.89 


K = 6, 
p{P n ) = 1.27 


4.98 


2.08 


1.59 


1.47 


1.45 


1.49 


1.57 


1.66 


1.77 


1.86 


1.96 


2.57 


1.91 


1.62 


1.52 


1.54 


1.57 


1.65 


1.73 


1.84 


1.93 


2.02 



Table 2: Ratio tk{s) for the estimation of s = /s with T N . Each pair of lines corresponds to 
a value of K with the known a 2 case on the first line and unknown a 2 case on the second one. 

7.1 Proofs of Theorems ED and E21 



7.1.1 Proof of Theorem 12.11 

By definition of j n , for any t S R n , we can write 

II* - t\\ 2 n = 7n (t) + 2a(t - Y, P n e) n + a 2 \\P n ef n . 
Let m £ M, since s m — s m -\- (j7T m P n s , this identity and. fj!2p lead, to 

ll s -s||n = II s ~ s m\\n + 7n(s) ~ 1n(s m ) + 2(7 (s - S m ,P n e) n 
= II s - s m||n + 7n(s) - 7n(Sm) ~ CT 2 \\lT m P n e\\l 

-2a{s - s, P„e) n + 2a(s - s m , P n e) n 
^ \\s ~ s m \\l+pen(m) - pen(rh) +2a 2 \\TTrhP n £\\n ( 36 ) 

2(t(s P n £) n -\- 2&(s S m , P n £) n (7 \\7T rn P n £\\ n . 



Consider an arbitrary a m G such that ||o m || n = 1, we define 

u = { ^ S ~ Sm )/W s ~ Sm ll n if s ^ ' K ' mS (37) 
1 a m otherwise . 

Thus, (|36p gives 

II s -s\\n < \\s - s m \\l + pen(m) -pen(m) + 2a 2 \\'ir,r h P n e\\ 2 l (38) 

+2<r|| S Sjtj II n | (llm , P n £) n | + 2(7 (s S m , P n E) n (7 \\ll m P n £ \\ n . 

Take a £ (0, 1) that we specify later and we use the inequality (j35"|) . 

(1 — a)\\s — s||^ ^ ||s — s m ||^ + pen(m) — pen(m) (39) 

+(2 - a)a 2 \\TTrhP n £\\ 2 n + aT 1 o 2 {u^, P n e) 2 n 

+ 2<r(s S m , P n £) n (7 \\1t m P n £\\ n . 



22 



c 


0.0 


0.5 


1.0 


1.5 


2.0 


2.5 


3.0 


3.5 


4.0 


4.5 


5.0 


K = l, 
p(P n ) = 1.27 


1.54 


1.49 


1.44 


1.40 


1.36 


1.33 


1.31 


1.30 


1.28 


1.27 


1.25 


1.50 


1.44 


1.39 


1.35 


1.32 


1.30 


1.28 


1.26 


1.25 


1.24 


1.23 


K = 2, 
p{P n ) = 1.25 


1.60 


1.53 


1.48 


1.45 


1.40 


1.37 


1.34 


1.32 


1.29 


1.28 


1.26 


1.54 


1.48 


1.42 


1.38 


1.35 


1.32 


1.29 


1.28 


1.27 


1.25 


1.24 


K = 3, 
p(P n ) = 1.25 


1.56 


1.50 


1.46 


1.42 


1.38 


1.35 


1.32 


1.30 


1.28 


1.27 


1.26 


1.51 


1.45 


1.41 


1.37 


1.34 


1.31 


1.29 


1.27 


1.25 


1.24 


1.23 


K = 4, 
p{P n ) = 1.25 


1.61 


1.54 


1.48 


1.42 


1.39 


1.36 


1.34 


1.31 


1.29 


1.28 


1.27 


1.51 


1.44 


1.40 


1.36 


1.32 


1.31 


1.28 


1.27 


1.26 


1.25 


1.24 


K = 5, 
p(P n ) = 1.25 


1.68 


1.61 


1.54 


1.48 


1.44 


1.41 


1.37 


1.34 


1.32 


1.30 


1.28 


1.56 


1.49 


1.43 


1.39 


1.36 


1.31 


1.29 


1.27 


1.27 


1.26 


1.25 


K = 6, 
p{P n ) = 1.24 


1.78 


1.70 


1.63 


1.57 


1.53 


1.48 


1.44 


1.42 


1.39 


1.35 


1.34 


1.61 


1.55 


1.48 


1.44 


1.40 


1.37 


1.34 


1.32 


1.30 


1.28 


1.28 



Table 3: Ratio r^-(s) for the estimation of s = /i with T . Each pair of lines corresponds to 
a value of K with the known a 2 case on the first line and unknown a 2 case on the second one. 

We choose a = 1/(1 + 9) £ (0, 1) but for legibility we keep using the notation a. Let us now 
introduce two functions pi,p2 '■ -M. — > M+ that will be specified later to satisfy, for all m £ Ai, 

pen(m) ^ (2 — a)pi(m) + a~ 1 p2(m) . (40) 

We use this bound in (|39p to obtain 

(1 - a)\\s - s\\l < \\s - SmWl + pen(m) + (2 - a) (cr 2 ||7r A P n e||^ -pi(m)) 

+a~ 1 (a 2 (urh, P n e)l - P2(m)) + 2a(s - s m , P n e) n 
— 2 II p II 2 

^ II s - s m\\n + pen(m) + 2a(s - s m ,P n e) n - a 2 \\-K m P n £\\ 2 n 
+{2- a) sup {a 2 \\'K ml P n e\\ 2 n -pi{m')) + 

+a _1 sup (o 2 {u m i , P n e) 2 n - p 2 (m')) . 

rn'eM 
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c 


0.0 


0.5 


1.0 


1.5 


2.0 


2.5 


3.0 


3.5 


4.0 


4.5 


5.0 


K = l, 
p(P n ) = 1.28 


2.01 


1.92 


1.86 


1.80 


1.76 


1.74 


1.70 


1.70 


1.68 


1.67 


1.68 


2.03 


1.93 


1.87 


1.81 


1.77 


1.72 


1.68 


1.65 


1.65 


1.66 


1.67 


K = 2, 

p(Pn) = 1.22 


2.02 


1.93 


1.85 


1.79 


1.75 


1.71 


1.68 


1.66 


1.66 


1.66 


1.66 


1.95 


1.88 


1.82 


1.78 


1.75 


1.71 


1.68 


1.67 


1.65 


1.64 


1.64 


K = 3, 
p(P n ) = 1.26 


2.04 


1.93 


1.86 


1.81 


1.76 


1.71 


1.68 


1.64 


1.62 


1.62 


1.62 


1.96 


1.87 


1.80 


1.74 


1.68 


1.66 


1.63 


1.63 


1.61 


1.62 


1.62 


K = 4, 
p(P n ) = 1.25 


2.12 


2.00 


1.90 


1.81 


1.73 


1.67 


1.64 


1.62 


1.60 


1.61 


1.60 


1.99 


1.90 


1.80 


1.73 


1.68 


1.65 


1.62 


1.60 


1.60 


1.60 


1.60 


if = 5, 
p(P„) = 1.24 


2.47 


2.34 


2.23 


2.17 


2.10 


2.05 


1.99 


1.95 


1.91 


1.88 


1.86 


2.30 


2.20 


2.11 


2.03 


1.97 


1.92 


1.88 


1.83 


1.82 


1.80 


1.80 


if = 6, 
p{P n ) = 1.26 


2.45 


2.32 


2.21 


2.11 


2.03 


1.99 


1.95 


1.91 


1.89 


1.86 


1.84 


2.17 


2.06 


1.99 


1.94 


1.89 


1.85 


1.84 


1.80 


1.79 


1.79 


1.75 



Table 4: Ratio vk{s) for the estimation of s = f'2 with F . Each pair of lines corresponds to 
a value of K with the known a 2 case on the first line and unknown a 2 case on the second one. 



c 


0.0 


0.5 


1.0 


1.5 


2.0 


2.5 


3.0 


3.5 


4.0 


4.5 


5.0 


K = 1 


1.11 


1.11 


1.09 


1.06 


1.04 


1.03 


1.03 


1.02 


1.01 


1.02 


1.02 


K = 2 


1.12 


1.08 


1.08 


1.06 


1.04 


1.03 


1.02 


1.01 


1.01 


1.01 


1.01 


K = 3 


1.13 


1.09 


1.07 


1.07 


1.05 


1.03 


1.01 


1.01 


1.02 


1.02 


1.02 


K = 4 


1.08 


1.08 


1.06 


1.05 


1.04 


1.02 


1.02 


1.01 


1.01 


1.01 


1.01 


K = 5 


1.10 


1.05 


1.06 


1.06 


1.03 


1.02 


1.02 


1.01 


1.01 


1.01 


1.01 


K = 6 


1.08 


1.07 


1.06 


1.05 


1.03 


1.02 


1.01 


1.01 


1.01 


1.01 


1.01 


K = 7 


1.11 


1.09 


1.08 


1.05 


1.03 


1.02 


1.01 


1.01 


1.01 


1.01 


1.01 


K = 8 


1.09 


1.06 


1.08 


1.05 


1.04 


1.02 


1.01 


1.01 


1.01 


1.01 


1.01 


K = 9 


1.10 


1.08 


1.07 


1.05 


1.03 


1.02 


1.01 


1.01 


1.01 


1.01 


1.01 



Table 5: Ratio rx(s, sq) for the estimation of s = f\ with T 



N 



Taking the expectation on both sides, it leads to 

(1-q)E[||s-5||2] ^ \\s-s m f n + pen(m) - Tr( *P n 7r m P n )a 2 / 



11 



+(2 - a)E 



sup {o 2 \\ir m iP n e\\ 2 n -pi(rri)) 

m'&M 



+a X E sup (a 2 (u m > , P n e)l - p 2 (m')) 

2 1 T~> /" try _ D A _2 



+ 



^ II s - s m\\ n + pen(m) - Tr( P n 7r m P n )a /n 

+(2 -a) E[(a 2 ||7r m ,P n e||2 - Pl (m'))_ 



rn'eM 



+a- 1 Y ^[(v 2 (u m >,Pne) 2 n -p2(m')) 



+ 



m'<=M 

^ II s - s m\\n + pen(m) - Tr( *P n 7r m P n )(T 2 /n 
+ (2 -a) J2 ^hm' + a- 1 Y E 2,m' 



■m'eM 



m'£M 
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c 


0.0 


0.5 


1.0 


1.5 


2.0 


2.5 


3.0 


3.5 


4.0 


4.5 


5.0 


K = 1 


1.08 


1.09 


1.07 


1.07 


1.09 


1.09 


1.08 


1.07 


1.06 


1.09 


1.07 


K = 2 


1.09 


1.05 


1.08 


1.09 


1.09 


1.08 


1.08 


1.08 


1.06 


1.06 


1.05 


K = 3 


1.12 


1.12 


1.11 


1.07 


1.09 


1.10 


1.09 


1.08 


1.07 


1.06 


1.07 


K = 4 


1.09 


1.11 


1.08 


1.10 


1.10 


1.09 


1.07 


1.06 


1.07 


1.06 


1.07 


K = 5 


1.10 


1.08 


1.09 


1.09 


1.09 


1.06 


1.06 


1.06 


1.07 


1.07 


1.05 


K = 6 


1.08 


1.04 


1.06 


1.07 


1.08 


1.07 


1.07 


1.09 


1.06 


1.06 


1.06 


K = 7 


1.06 


1.05 


1.07 


1.08 


1.10 


1.09 


1.07 


1.09 


1.08 


1.07 


1.06 


K = 8 


1.08 


1.13 


1.08 


1.09 


1.09 


1.08 


1.06 


1.07 


1.07 


1.06 


1.06 


if = 9 


1.13 


1.05 


1.09 


1.09 


1.07 


1.07 


1.07 


1.06 


1.07 


1.07 


1.06 



Table 6: Ratio r^(s, sq) for the estimation of s = /s with T . 

Because the choice of m is arbitrary among M, we can infer that 

(1-q)E[||s-s||2] <; inf { ||s - s m \\ 2 n + pen(m) - Tr( P n 7r m P„)<r 2 /n} 

+(2 - a) E i,m + oT x ^ E 2,™ • ( 41 ) 



We now have to upperbound Ei >m and E2, m m (|4ip . Let start by the first one. If S m = {0}, 
then TT m P n = and pi(m) ^ suffices to ensure that Ei jm = 0. So, we can consider that the 
dimension of S m is positive and ir m P n 7^ 0. The Lemma [8 . 2 1 applied with A = ir m P n gives, for 
any x > 0, 

F (n\\ir m P n e\\ 2 n > Tr(P n ^ m P n ) + 2y/p 2 (P n )Tr( t P n ir m P n )x + 2p 2 {P n )x) ^ e~ x (42) 
because p(ir m P n ) ^ p{ir m )p{P n ) p(P n ). Let /3 = 6 2 /(l + 20) > 0. and flU lead to 



P (n\\ir m P n ef n > (1 + /3)Tr( *P n 7r m P n ) + (2 + /T V^rjs) ^ e 
Let 5 = 9 2 /{{l + + 26 + 26> 2 )) > 0, we set 

npi(m) = ((l + /3) + (2 + /3- 1 )5L m )Tr(*P n ^ m P n ) ( r 2 

and (|4"3|) implies 

(^((j 2 ||7r m P n e|| 2 -pi(m)) + ^ £J o?£ 
(n||7r m P n e|| 2 - npi{m)/(j 2 ^ n^/a 2 ) d£ 

(2 + /3" 1 V 2 ( J Pn)^ 



(43) 



E m ,i 







/>oo 

^ / exp 
J 

. (2 + /3- 1 )p 2 (P n ) C j 2 f <5L m Tr(*P n7 r m P n 
^ exp 1 



P 2 (^n 
r 2 



P 2 (Pn 



(44) 



We now focus on E m; 2- The random variable (u rn ^P n E) n — ^P n v,ffij e) n is a centered 
Gaussian variable with variance || t P n u m \\ 2 1 /n. For any x > 0, the standard Gaussian deviation 
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inequality gives 

~Ww — W ^ ex P 

that is equivalent to 



2p 2 (Pn 

(n{u m ,P n e)l > 2p 2 (P n )x) e~ x . (45) 



We set 

np 2 (m) = 28L m Tr( t P n TT m P n )a 2 

and (|4"5|) leads to 



E 



m,2 







oo 







'((a 2 (u m ,P n e) 2 n -P2(m)) + ^) # 
1 ((u m , P n e)2 - np 2 {m)/a 2 ^ n£/cr 2 ) d£ 



^ r f <5L m Tr(PU m P n ) < \ 

* y exp l — 7m w(P^ 2 ) di 

< V( f ^(-^y) • (46) 
We inject (|44|) and (|46p in (|4ip and we replace a, /3 and 5 to obtain 

-E[|| 5 -5||2] < inf - s m \\ 2 n + pen(m) - Tr( P„7r m P n )<7 2 /n} + P ^ Pn ^ R e 

Re=ce Sr p y — c SP i { p n) ) 



e + 

where we have set 



and 

_ 2(9 4 + 86> 3 + 86» 2 + 46i + l 

° e ~ e 2 (i + 9) ' 

Finally, (|40p gives a penalty as (fT5|) and the announced result follows. 
7.1.2 Proof of Theorem [2721 

In order to prove Theorem 12.21 we show the following stronger result. Under the assumptions 
of the theorem, there exists a positive constant C that only depends on p and 9, such that, 
for any z > 0, 



+ 2 n 



N (lAz-P/ 2 )+Rp n , p (T,z) (47) 



where the quantity % is defined by 



,2 - + 4 ■ f f,| ,,2 , 2 (^ + 2 ) 



W = II* - ~s\\n ~ — |ll«-»m|ln + -0^P en M 



26 



and we have set Rp nP (J-, z) equal to 

v / TiCP n 7r m P n ) \ f L m TvCP n 7T m P n ) y p/2 
^ V p^Pn-KmPn) ) \ P 2 {Pn) J 

Thus, for any q > such that 2(q + 1) < p, we integrate ([47|) via Lemma [8.11 to get 



/•CO 

E [?4] = / c?t 9-1 P (H+ ^ t) (it 

-J J qz q ~ l ¥ 



On J Jo \9 + 2 ^ n 

^ C\p,q,9)T p (^^y R ™ 9 {F) (48) 



where we have set 



p - e( ) = ° + m£ JL,., v 1 + J I ) ■ 

Since 

1/9 



Eni s - S i^i 1/9 <E 



+ 8 • f in 1,2 , 2(0 + 4) 

mf <^ ||s - s m || n + — t— — pen(m) } + 



it follows from Minkowski's Inequality when q ^ 1 or convexity arguments when < q < 1 
that 

E[||s-s||^] 1/9 <2(9" 1 - 1 )+ (c"(9) inf { ||* - a m \\* + pen(m)} + E [H%] 1/q ) . (49) 

Inequality (JTHJl directly follows from (g8|) and (|49|) . 

We now turn to the proof of (|47|l . Inequality (|39p does not depend on the distribution of 
e and we start from here. Let a = a(0) € (0, 1), for any m£ jWwe have 

(1 - - s|| 2 < \\s - s m \\l + pen(m) - pen(m) + (2 - a)<T 2 ||7r A P n e||^ 

+a~ 1 o- 2 (u A , P„e) 2 + 2a(s - s m , P„e)„ 

where u m is defined by ([37]) . Use again (|35[) with a to obtain 

(1 - a)||s - s\\l < ||s - s m || 2 + pen(m) - pen(m) + (2 - a)<T 2 ||7r^P„e||^ 

+a- 1 CT 2 (n A ,P n e) 2 +2a\\s 

\\n I (^m j Pn£)n \ 

< (1 + a)||s - s m || 2 + pen(m) - pen(m) (50) 

+ (2 - a)o 2 \\ltraP n £\\ 2 n 

a 2 {u^, P n £) 2 n + a' 1 a 2 {u m , P n £) 2 n . 

Let us now introduce two functions Px,p2 '■ -M — > K+ that will be specified later and that 
satisfy, 

Vm G A4, pen(m) ^ (2 — a)pi(m) + a~ 1 p2(m) . (51) 
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Thus, Inequality (|50j) implies 

(1 — a)\\s — s|| 2 < (1 + a)\\s - s m \\ 2 n + pen(m) + a~ 1 p2(m) 

+ (2 - a) (a 2 \\TT A P n e\\l - p\(rh)) 

+QT 1 (<T 2 {Urh, P n £)n -P2ifh)) 

+OT 1 (o- 2 (ii m ,P n £) 2 -p 2 (m)) 
s$ (l + a)(\\s-s m f n + 2pen(m)/(l + a)) 

+ (2 -a) sup (o- 2 ||7r m /P n e|| 2 - p 1 (m')) , 

m'GA4 

+2a _1 sup (a 2 (u m > , P n e)l - p 2 (m')) . 

m'eM 

Because the choice of m is arbitrary among Ai, we can infer that, for any £ > 0, 
F((l-a)^+^0 < Pf(2-a) sup (a 2 ||7r m P n£ ||2 -^(m)) > D 

V me-M "V 

+P(2a- 1 sup (a 2 (u m ,P n e) 2 -p 2 (m)) >| 

V m€.A/f ' * 



< 2 F (V|KP n e|| 2 > pi(m) + ^ ) 
+ ^(a 2 (u m ,P n e) 2 n ^p 2 (m) + ^pj 

meM 



We first bound Pi,m(£)- For m € -M such that 5 m = {0} (i.e. n m = 0), pi(m) ^ leads 
obviously to Pi OT (£) = 0- Thus, it is sufficient to bound Pi, m (£) f° r 771 such that 7r m is not 
null. This ensures that the symmetric nonnegative matrix A = t P ri -K m P n lies in M n \ {0}. 
Thus, under hypothesis ([H]), Corollary 5.1 of [1] gives us, for any x m > 0, 

Ct(p)r p Tr(A) 



n\\ir m P n e\\ 2 n > Tr(A) + 2^ p{A)Tt{A)x m + p{A)x m 1 < , , 

J p(A)x 

where C\(p) is a constant that only depends on p. The properties of the norm p imply 

p(A) = p(W^n)(vr m P„)) = p(7r m P n ) 2 ^ p 2 (P n ) . (53) 
By the inequalities (|53[) and (|35p with 0/2 > 0, we obtain 

nlKP^I 2 > (l + °~) Tr(i) + (l + I) P 2 (Pn)x m ) < • (54) 

\ / V / / p(A)x m 

We take a = 2/(9 + 2) G (0, 1) but for legibility we keep using the notation a. Moreover, we 
choose 

npi(m) = (l+ °- + gT^iy) Tt(^n7r m Pn)a 2 
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and 

g L m Tr(^ n7 r m P n ) + ng/a 2 

Xm 2(0 + l)(0 + 2) X p 2 (P n ) 

Thus, Inequality (|54|) leads to 

V 2 ( 2 - «) 

V||vr m P„ e ||^p 1 ( m )+ (e + 2)e 



4(0 + 1), 

< P ^n||vr m P n e||2 Tr(P n vr m P n ) + ( X + ^) P 2 (^»)^ 

^ ^/ m IV(*P n 7r m P n )T p /L TO 'n:(*P ri 7r m P n )+T^/ ( 7 2 \" 1 ' /2 

^ p(Wn) [ 7m J • (55) 

We now focus on P2, m (£)- Let y m be some positive real number, the Markov Inequality 
leads to 

F(\(u m ,P n e) n \ > y m ) ^ y m p E[\(u m ,P n e) n \ p ] = y~ p E [\ ( P n u m , e) n \ p ] . (56) 
Since p > 2, the quantity r p is lower bounded by 1, 

r p = E[|eifl >E[ei] p/2 = l . (57) 
Moreover, we can apply the Rosenthal inequality (see Chapter 2 of [31]) to obtain 

E [| ( l P n u m , e) n \ p ] ^ C 3 (p)n- p (r p | ( t P n ^ m ) i | p + n^ 2 1| *P nUm ||£ J (58) 

where Ca(p) is a constant that only depends on p. Since p > 2, we have 

n / n \P/2 

£|('iW)i| P < X)(*Pn«m)? =nP/ 2 || t P n n m ||^n^V(Pn) • 

1=1 \i=l / 

Thus, the Inequality (j58|) becomes 

E [K^n^e^H < 2C 3 (p)/(P„)r p n-P/ 2 
and, putting this inequality in (|56j) . we obtain 



P(|(u m ,P n e) n | > y m ) ^ 2Cz(p)f?{P n )T p n- p l 2 y^ . (59) 



•v 

We take 



np 2 (m) = ^i-Yy(j 2 L m Tr(P n 7r m P n ) 



and 

2 _ 

" 2(0 + 
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Finally, (|59p gives 



~2,m 



(0 



a 2 (u m ,P n e)l ^ p 2 {m) + 
a 2 {u m ,P n e)l ^ p2(m) + 



Taking 



2(0 + 2) 
< P«u m ,P n£ > 2 ^) 

. ~, m / ^Tr( *P w ^ m P TO ) + rig a 2 \ ~ p/2 



(60) 



meA4:S ro ^{0} 

and putting together Inequalities ()52|) . (|55p and (|60p lead us to 
P ((1 - a)H + >0 < Pl >™^ + E P2 >™^) 



mG.M:S m ={0} meM:S m ^{0} meX:5 m ^{0} 



mSX:S m ={0} 



i:S m = {0} I V<J P ^ n)J J 

< JV (1 V C±(pO))T p f 1 A ( p2( pf )(T 2 ) ^ J +C 5 (p,»)T P i2 , (0 • 

For z > 0, take £ = p 2 (P n )a 2 z/n to obtain ([47]) . We conclude the proof by computing the 
lowerbound (|5ip on the penalty function, 

(2 - a)pi(m) + a~ 1 p 2 (m) = ^-^^-pi(m) + ^-^p 2 {m) 

/ . , 2 + 80 + 8 \ Tr( 
= V + + 4(0 + l)(^ + 2) Lm J n °" ' 

Since (6» 2 + 86* + 8)/(4(6» + 1) (6* + 2)) < 1, the penalty given by dTTJ) satisfies the condition ([51]) . 

7.2 Proofs of Theorems O and EH 
7.2.1 Proof of Theorem [3TT1 

Given > 0, we can find two positive numbers 5 = 5(9) < 1/2 and n = rj(0) such that 
(1 + 0)(1 - 25) > (1 + 2r?). Thus we define 

O n = {<r 2 > (l-25)a 2 } . 
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On O n , we know that 

Vm e M, pen(m) > (1 + 2 V ) ^ a 2 . 

n 

Taking care of the random nature of the penalty, we argue as in the proof of Theorem 12.11 
with L m = Tj to get 

TO. r,| ~,| 2 -„ 1 ^ V + 1 - r fn 112 .IDT f M Tr( f P n 7T m P n ) 2 ) p 2 (P n )(J 2 „„ , x s 

L J 77 meM [ ra J n ' ' 

(61) 

where i?p^ r? (-? r ) is defined by 

R'^ V {F) = C v ^ exp - 5 \, " " ■ 

meA4 V P ™ / 

We use Lemma [8. 3 1 and (j2T|) to get an upperbound for E[pen(m)], 

E[pen(m)] < (1 + 0) V n m n V + (1 + fl) ''" 

< n I m ^^^Pn) , . n . Tr(^A)|| g -7T g ||2 
^ (1 + fl) lEi ^mPn) a 2 + 2(1 + fl)[| g _ . 

n 

The Proposition 11.11 and ([6Tj) give 
E[|| S -s||^„] <C(fl) igf E - 3 m ||»] + 2(9 + l)||n - 7Ts\\l + p2( y a2 B!^{T) (62) 

where C(0) > 1. 

We now bound E[||s - s||^lnc]. Note that 

II ~ 1 1 2 || 1 1 2 i — I t~) 1 1 2 II 1 1 2 i 2 1 1 t~) 1 1 2 

II ^ ^lln = II 8 ^mlln ' ® n^rh^n^\\n : H^lln ' ® iK^lm 

and thus, by the Cauchy-Schwarz Inequality, 

E[|| S - ~sf n l Q o] ^ \\s\\ 2 n F (U c n ) + a 2 M[\\P n e\\ 2 n l Q c n ] < (\\s\g + a 2 n\Pne\t] 1/2 ) P (^) 1/2 . 
Moreover, the eigenvalues of the matrix P n l P n are nonnegative and so 

E[||^ll^] 1/2 = {Vav(\\P n e\0 + E[\\P n e\\ 2 n } 2 ) 1/2 

- VTr(*P„Pn) (Tr(*P n P n ) + 2 P \P n )) 
n 

< Tr(P n P n ) + (TlCPnPn) + 2p 2 (P n )) 
2n 

Ti( t P n P n ) + p 2 (P n ) 
n 
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Finally, the Lemma 18,41 gives 



E[\\s-~sf n l n? ] < C'(9) \\s\\ 2 n + 



2 , Tr (*P n P n ) + P \P n ) 2 



n 



a exp 



^ 2 Tr(*P n P n ; 



< c'(e)(\\ s \\l + 



p 2 (P n )(n + l) , 

— cr exp 



32p 2 (P, 

fl 2 Tr( *P n P n 

32p 2 (P n ) 



< C'(9){\\s\\ 2 n + 2p 2 (P n )a 2 )e W 



9 2 TrCP n P r 
32p 2 (P n ) 

where C'{9) > 1. The inequality (|24p follows by collecting fj62H and (|63p . 



(63) 



7.2.2 Proof of Theorem ET2l 

Given > 0, we can find two positive numbers 5 = 5(6) < 1/3 and r] = rj(9) such that 
(1 + 0)(1 - 35) > (1 + 2*7). Thus we define 



Q^ = {tr 2 >(l-3<5)<7 2 } . 



On O' , we know that 



VraeM, pen(m) > (1 + 2f?) Tr ( ^ m P0 ^ 

n 

Let m be any element of M. that minimize ||s — s m '|ln + cj2 Tr(*P n 7r m /P n )/n among m' £ .A/f. 
Taking care of the random nature of the penalty, we argue as in the proof of Theorem 12.2 
with L m = r] to get 



E[\\s-sf n n Ql f /9 ^C(q,9)E 
where R n (p,q,9) is equal to 



S S m L, ~r 



2 Tr( P n 7r m P n ) A 2 



-cr 



n 



+ 



P 2 (P„)^ 



n 



Rn(p,q,9)^ 



C'(p,q,9)r p 



N + Yl ( 1 + 



Tr( P n 7r m P n )^ ^Tr( P n TT m P r 
P( tPnKmPn 



P 2 {Pn 



q-p/2 



meM:S^{0} 

Since q ^ 1, by a convexity argument and Jensen's inequality we deduce 

Tr^P^PO^.^A , p 2 {PnW 



E[\\s- s\\l q l n ,] 1/9 < C(q, 9)(\\s- SrnWl + 



Lemma 18.31 and (12 ip give 



n 



-E[a 2 ] + 



n 



Rn(j>,q,9) 1/q . 

(64) 



Tr(*P n7 r m P n ) TC , r . 2l Tr(*P n vr^P n ) 2 nTr(*P n 7r™P n )||s - 7rs|| 2 



r? 



nTr(*P n (/ n -^)P n ) 



Tr(*P n 7T m P n ) 2 . ii ii2 



Thus, by the definition of m and Proposition (|64p becomes 

E [||* - S||^lnJ 1/9 < C(<Z, 9) (m E[\\s - s m \\ 2 n ] + 2||s - vr S || 2 ) + ^^R n ( p , q , O) 1 '* . 

(65) 
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We now bound E[||s - s\\ 2 l q tn^ }. Note that 

II ~ 1 1 2 II 1 1 2 i 2ii t~) 1 1 2 ^ II 1 1 2 i 2 1 1 t~) 1 1 2 

11^ S\\ n — || S "SmHn ~r 0" ||7Tm-f^ ri^Hn ^ ll^lln ' ® ll-'n^ I In • 

Since q ^ 1, we have 

E[|| S -5||^1^] < ||.[|^P(n^) + o a 'E[||P BE [|^l , ! .] . 
Holder's Inequality with exponent p/2q > 1 gives 

E[||P„ £ ||^l n ,c] < EUlPnell^^P^) 1 - 2 ^ 

and, since 

n\\Pne\\ p n ] 2q/p ^ p 2 i{P n )n\e\\i] 2q,v ^ p 2q (P n )r 2q/p , 
we obtain by using Lemma 18.51 that 

n\\s - ~s\\ 2 n q ^i\ < (M^+a^p^iP^nnK) 1 - 2 ^ 

^ C(p,q,eK(p, q ,e)(\\s\\ 2 n q +a 2q p 2q (P n )T^) (r p p^(P n )TrCP n P n r 
where 

a p = (p/2 - 1) V 1 and /3 p = (p/2 - 1) A 1 . 

Thus, we get 

E[||,s - s\\ 2 n q l^J 1/q < C(p, g, g, 0)r^(|| S ||„ + W 2 (P> 2 ) [ T ^ tp ^ p ) 

(66) 

The announced result follows from (l65l) and (16611, 



(*). 



7.3 Proofs of Corollaries and Propositions 
7.3.1 Proof of Corollary [2J] 

Let us begin by applying Theorem 12,11 with constant weights Ljji — 

V[\\s-nl] + inf - s m \\ 2 n + (fl + L) Tr (^ P ") a 2| + P^El^ Rn 

(67) 

We now upperbound the remainder term. Assumption (A3) and bounds on and L lead to 
p M ^ 2(1 + fl) 4 ^ / fl 2 L Tr(P w 7r m P w ) ^ 

2(1 + g) 4 ^ / c9*L \ 



2(1 + #) 4 X - v 



£ 3 

^ ^3 z^ e 
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The last bound is clearly finite and we denote it by R = R(6,u>). Thus, we derive from ([6 

-^—E[\\s-s\\ 2 n ] ^ inf l\\s-s m \\l+((e + L)TvCPn^ m Pn) + RpHPnKdim(S m )Vl)) — 
a + 1 L J m£M [ n 

and hypothesis (A3) gives 

JL-E[\\s-s\\ 2 n ] ^ inf l\\s- Sm \\ 2 n + (e + L + R/c)(TrCPnKmPn) Vcp 2 (P n )) — } 
t7 + l meM y n ) 

that concludes the proof. 



7.3.2 Proof of Corollary I2T21 

Since p > 6, we can take q = 1 and apply Theorem 12.21 with constant weights L m = L to get 



E ||s - s\\ z n \ ^ C inf { ||s - sm\\i + (1 + + L) 

m£M 



Tr(*P n7 r m P n ) 2 1 , p 2 (P n )a 



n 



-Rn(pA,0) . 

(68) 

To upperbound the remainder term, we use Assumption (A 3 ) and bounds on Nj, and L to get 

Tr(P n ^ m P n )\ /LTr(*P n7 r m P n ^ ^ p/2 



Rn(p,l,0) ^ C't p 



^ C't„ 



^ C"r n 



1+ E 



m£M:S m ^{0} 



1 + 



p^PnTTmPn) J \ p 2 (P, 



1+ 2 (l + dim(5 m ))(Lcdim(5 m )) 1 - p / 2 
(cw)' 1 ^/ 2 



1 + 



■p/2 



d>0 



1 + (c^i-p/a ^(1 + d)P /2-2-. d l-p/2 
d>0 



The last bound is clearly finite and we denote it by Rt p = R(8,p, u,u',c)t p . Thus, as we did 
in the previous proof, we derive from (|68p and (A3) 

-^E [||s - §]&) < inf (|| s - Sm ||2 + (i + + L + jRt p / c ) (Tr(*P n 7r m P n ) Vc/> 2 (P n )) — ) . 



Since r p ^ 1, the announced result follows. 



7.3.3 Proof of Proposition 14.11 



The design points (xi, yj, . . . , yf ) are all assumed to be independent realizations of a random 
variable in [0, with distribution v®v\®- ■ -®vk- We denote by //% the unit k x k matrix 
and, for any a = (ai, . . . , a&)' G we define the usual norm 



M» -(£■?) 



1/2 
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We also consider 5 n = dim(i ? ) ^ Dn ^ + ■ ■ ■ + Dn + 1 and iV n = n — D n — S n . The quantities 
8 n and N n are random and only depend on the y^s and not on the Xj's. 

The space E is generated by the vectors = (<f>i(xi), . . . , (f>i(x n ))' , for i = 1, . . . , D n . 
Let {f^\ . . . , /^"^} be an orthonormal basis of F and {g^\ ■ ■ ■ ,g^ Nn '} be an orthonormal 
basis of G = (E + F)^. In the basis b of M. n given by the e^'s, the /W's and the gW's, the 
projection P n onto £" along F + G can be expressed as 



M 







Considering the matrix C that transforms b into the canonical basis, we can decompose 
P n = CMC^ 1 . By the properties of the norm p, we get 

P 2 (P n ) ^ p 2 (C) P 2 {M)p 2 {C- 1 ) = Qp(*CC)) (np(f . 



For any p > 1, we deduce from the previous inequality that 



F(p(P n ) >p)^F(p >p\+¥ {p{n 'C^C- 1 ) > p) 



(69) 



Note that for any invertible matrix A G M n (R) and A > 1, if p(A - I n ) < 1 - A 1 , then 
p{A^) < A. Thus, Inequality (|69p leads to 



\p{P n )>p) < P p 



n 



>P + 



J n > 1 - P 



-1 



< 2P U 



n 



In) >l~p 



-1 



Let us denote by <3? the D n x Z) n Gram matrix associated to the vectors , 
define the D n x S n matrix f2 by 

VI ^ i < £>„, VI < j < tf n , = (e« / w ) n , 
then we can write the following decomposition by blocks, 



(70) 

, e ( D "). If we 



n 



$ 17 
^ h n 
o o 



Consequently, by the definition of p(-), we obtain 



P 



n 



(71) 



where we have set 



Using dnj in (J7D]) leads to 



o 
to 



P(p(P n ) > p) ^ 2P p($ - I Dn ) > 



l-p- 1 



+ 2P p(fi') > 



2Pi + 2P 2 . (72) 
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First, we upperbound Pi. Let x > 0, we consider the event 



E x = | VI < i,j < D n , <e« e^) n - jf 

Because $ — Jo n is symmetric, we know that, on the event E x , 
p(®-IdJ = sup - I Dn )a\ 

aeM D «, |a| 2 ^l 



< ViMV2x + BMx} . 



sup 

aeM D ™, |a] 2 ^l 



i=i j=i ^ ^° 



4>i{u)4>j{u)i>(du) 



sup y^y^ |ajaj| (|V^(0)|\/2~r + |j%j(</>)|a: 
«6lR D ", |o|a<l j=i j=i 



< y /2xL ( / ) + xL^ 
Thus, for any x > such that 



1-p 



-i 



(73) 



we deduce 



(e®,e®) n - / ^(«)^-(«)i/(d«) 



(e»,e^) n - / &(u)^(u)i/(du) 



> V ij (<P)V2x + B ij (0)x 
>V ij {4>)^ + B ij (4>)x) . (74) 



Pi < P|3(i,j) : 

< EE' 

1=1 j=l 

The choice x = (1 — p~ 1 ) 2 /(12L(0)) satisfies (|73p and we apply Bernstein Inequality (see 
Lemma 8 of |8j) to the terms of the sum in (|74|) to obtain 

nil-p- 1 ) 2 ' 



Pi < 2D z n exp 



12Lw 



(75) 



It remains to upperbound the probability P2. Let x > 0, we consider the event 
E' x = {Vl < i < £> n , VI < j < <5 n , |(e«,/^) n | < + • 
By definition of the norm /?(•), we know that, on the event E' x , 
p(Q,') = 2 sup I *aSl&| 



|a| 2 +]6| 2 <l 



^ 2 sup 

|a| 2 <l, |6| 2 $1 



fn <?n 

i=i i=i 



5n 

< 2 sup ^J>^-| (e« /<*>>, 

|a| 2 sSl, |&| 2 



< 2 V / D n 5 n [y/2x + b^y/nx) . 
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Thus, for any x > such that 



2y/D n 5 n [V2x + bsy/nx) < 



l-p- 1 



(76) 



we apply Bernstein Inequality conditionally to the yj's to deduce 



P v (p(fi')>L^] < P„(3(/-..y) : 



(e (i) ,/ (i) )n| > V2^ + W^c) 

< £X>y( (e W ,/ (i) )n >\/2i+W^) 
i=l j=i 



< 2D n 5 n e~ nx ^2D n D'e- nx 



(77) 



where P^ is the conditional probability given the y('s. Indeed, under F y and (|30p . the variables 
(e®, f^) n are centered with unit variance. The choice 



1\2 



1 6 max { 4D n 5 n , b^y/nDnSn} 



satisfies ([76"]) and (177]) leads to 



Pj, p(fi') > 



E 



1-p 



-i 



exp 



n(l — /o 



-1\2 



^ 2D n D' n exp 



16 max {4D n 5 n , b^nD n 5 n } 
n(l-p~ 1 ) 2 > 



16 max {4D n D' n , b^y^D^} 
The announced result follows from (1721). (1751) and (1781). 



(78) 



7.3.4 Proof of Proposition I5TT1 

The collection T BM is nested and, for any d 6 N, the quantity is bounded independently 
from d Consequently, Condition ()19p is satisfied in the Gaussian case and ([20]) is fulfilled 
under moment condition. In both cases, we are free to take L = 6 = rj/2 and (Ai) is true 
for K = rj. Assumption (A3) is fulfilled with c = l/p 2 and, since dim(S m ) > for any 
m £ Ai, we can apply Corollary 12.11 or 12.21 according to whether (Hcau) or (Hmohi) holds. 
Moreover, we denote by E e (resp. E^) the expectation on e (resp. the design points). So 
E M [•] = E e [E rf [-]]. 

We argue in the same way than in Section [?] and we use (A3) to get 



E e , d [|| 5 -5||2] < C inf E d 



\S — S 



2 Tr( ^PnlTfnPf^ 2 



m\\ n 



+ 



-a 



n 



+ C'(l + p) 2 [E d [\\t-n F+G t\\ 2 n ] + -a 
1 n 



^ C inf tEMa-s. 



1*1 + ^iMp 2 a 2 ) + C '(l + p) 2 ( E d [\\t - n F+G tf n ] + V 



n 



n 
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The definition of the norm || • [| n implies that, for any / E L 2 ([0, 1], v) 

n 



n 1 



/(x) 2 !/(dx) 



Since s G 7i a (R), it is easy to see that this function lies in a Besov ball. Thus, we can apply 
Theorem 1 of [9 J and we get, for any m £ A4, 



Mh-SmWl] < C(a,R)dim(Sr, 



-2a 



Arguing in the same way for the tj S L 2 ([0, 1], vA and, since F _L G, we obtain 



K 



E d [\\t-ir F+G t\\ 2 n } < C{K)Y,M¥ 3 -k f+g V\\1\ 

3=1 
if 

3=1 
if 



3=1 
K 



< C(K)^E d [||^-7r^|| 2 ] 

3=1 

< C(a, i?, K)D- 2a < C(a, i?, if) dim(£ m )~ 2a 



Consequently, for any m G M, we obtain 



*M [II s " 511^1 < C" ( dim(S m )- 2a + dim(5m) ' ' 



+ - 

n 



Since a > £ n , we can consider some model S m in F BM with dimension of order n 1 /( 2a+1 ) and 
derive that 



E F 



d a - s 



| 2 ] < C" ^2n- 2a /( 2a+1 ) + ±) < C Q n" 2a /( 2Q+1 ) . 



8 Lemmas 

This section is devoted to some technical results and their proofs. 

Lemma 8.1. Let p, q > be two real numbers such that 2q < p. For any 9 > 0, the following 
inequality holds 



oo qz q-i 



dz < C(p, g)# 9 - p/2 



where C(p,q) = p/(p — 2q) . 
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Proof. By splitting the integral around 6, we get 



dz = I — r-i^dz + / — r~i^dz 



(# + ^/ 2 7o (# + ^ 2 7e (^ + Z)P/ 2 

< (i + e«-p/ 2 . 

p-2q 



The next lemma is a variant of a lemma due to Laurent and Massart. 



□ 



Lemma 8.2. Let A € M n \ {0} and e = (ei, . . . , e n )' be a standard Gaussian vector of W 1 . 
For any x > 0, we have 

F(n\\As\\l^ Tr^A) + 2y/p(A) 2 Tr(AM)i + 2p(^l) 2 x) < e~ x (79) 

and 

P (npe|| 2 < Tr(A l A) - 2y / p(A) 2 7V<A*A)x) < . (80) 

Proof. It is known that is a centered Gaussian vector of R n of covariance matrix given by 
the positive symmetric matrix A 1 A. Let us denote by ai,...,a n ^ the eigenvalues of the 
A t A. Thus, the distribution of n||Ae|| 2 is the same as the one of Yl7=i a « e i • We nave 



n 

(1; 



p(A) 2 = max |oi| and Tr(A*A) = 

i=l,...,n — ' 

i=l 

Because the a^'s are nonnegative, 

n 

^af <p(A) 2 Tr(^U) 
i=i 

and we can apply the Lemma 1 of |22j to obtain the announced inequalities. □ 
We now introduce some properties that are satisfied by the estimator a 2 defined in ([22]) . 
Lemma 8.3. In the Gaussian case or under moment condition, the estimator a 2 satisfies 



|2 

E [a 2 ] = a 2 + 



-21 2 , n \\ s 7rs \\n 



Tr(*P n (I n -<ir)P n ) • 
Proof. We have the following decomposition 

\\Y-irY\\ 2 n = \\ s - 7TS \\ 2 n + a 2 \\(I n -TT)P n e\\ 2 n + 2a{s-TT S ,P n e) n . (81) 

The components of e are independent and centered with unit variance. Thus, taking the 
expectation on both side, we obtain 

E [\\Y - nYWl] = \\s - vrsll 2 + a 2 T^! P ^ ~ 7r ) P ") 



n 

□ 
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Lemma 8.4. Consider the estimator a 2 defined in the Gaussian case. For any < 5 < 1/2, 



(<t 2 



where C$ > 1 only depends on 5. 

Proof. Let a £ V 1 - such that |la[|? = 1, we set 



We have 



(s — lTs)/\\s — lTs\\ n if S ^ 7TS , 

a otherwise 



2cr|(s - 7rs,P n e) n | = 2cr|(u, P n e) n \ X ||s - 7rs[|, 



and we deduce from (IHT 



||y-7ry||2 ^ <7 2 ||(/ n -7r)P n e||2 , - a 2 («, P n e> 2 

= a 2 {\\P n e\\l - {hP n e\\l + (u, P n e) 2 n )) 

= a 2 {\\P n e\\ 2 n - h'P n e\\l) (82) 

where tt' is the orthogonal projection onto V © Mu. Consequently, 

P (a < (1 - 2<5)a 2 ) < P(n||P„e|| 2 -nUvr'P^H 2 ^ (l-25)Tr('P ri (/„-vr)P n )) 
< P (n||P n£ || 2 - Tr( P n P n ) < -<5Tr( *P n (J n - vr)P n )) 

+P (nUvr'P^H 2 - Tr(P n vrP n ) > 5Tr('P n (/ n - tt)P„)) 
= Pi + P 2 • (83) 



The Inequality (j80|) and (|2ip give us the following upperbound for 

By the properties of the norm p, we deduce that 

Tr( *P n vr'P n ) = Tr( *P n ^P n ) + Tr( *.P n 7r u P n ) < Tr( *P n 7rP n ) + p 2 (P n ) (85) 



where we have defined 7r u as the orthogonal projection onto Rit. We now apply (I79p with 
yl = 7r'P n to obtain, for any x > 0, 

P (nll^ell 2 > (1 + 5/2)Tr(*P n vrP n ) + (1 + S/2)p 2 (P n ) + (2 + 2/S)x) 
< P (n||7r'P n e|| 2 > (1 + ^Tr^vr'P,) + (2 + 2/<5)x) 



< P [n\\TT'P n e\\ 2 n - Tr( 'P n 7r'P n ) > 2 ^Tr( *P n 7r'P n )x + 2x 

^exp(-x/ / o 2 (vr , P n )) 
^exp(-x/ / o 2 (P n )) . 



40 



P 2 (Pn), 

(86) 



Obviously, this inequality can be extended to x G R, 
P (n||vr'P n e||2 ^ (1 + 5/2)Tv( 'P n vrP n ) + (1 + eV2)p 2 (P„) + (2 + 2/<f)x) < exp ' " ° 

and we take 

X = ^^( 5Tl -(^(/n-vr)P„)-^Tr(*P n ^)-(l + 0p 2 (P n 
" / «5IV( l P n P n ) - ^Tr( P n ^P n ) - ( 1 + f) p 2 (P n 



2(,5 + l) y-\'^ n , 2 "v-«»^/ ^ , 2 
(5 /<5Tr('P„P n ) / , S\ 2 . 



> ^TI)l^4^-l 1 + 2)^ 



STt( *P„P n ) / , <J\ 2 . 



Finally, we get 

f 5(6 + 2) f 5Tr( f P n P n ) 

^ 6XP V 4(* + l)Utf + 2)p*(P B ) / + 

/ 5(5 + 2) \ / 5 2 Tr( *P n P n ) \\ 

To conclude, we use (EH) and (EH) in (ESI). □ 



Lemma 8.5. Consider the estimator a 2 defined under moment condition. For any < 5 < 
1/3, there exists a sequence (fts tn ) n £fq of positive numbers that tends to a positive constant n$ 
as 7V(*P n P n ) / p 2 (P n ) tends to infinity, such that 



[a 2 < (1 -35)a 2 ) < C(p,(5)/Ci, n r p p( p - 2 ) v2 (P n )2K*PnJ'n: 



-((p/2-l)Al) 



Proof. We define the vector u G V 1 - and the projection matrix it' as we did in the proof of 
Lemma 18.41 The lowerbound (|82|) does not depend on the distribution of e and gives 

P (<r 2 < (1 - 35)cj 2 ) < P {n\\P n e\\ 2 n - n\\7r'P n e\\ 2 n < (1 - 3<5)Tr( *P n (J n - vr)P n )) . (88) 

Since the matrix l P n P n is symmetric, we have the following decomposition 

n||P„e|| 2 - Tr( l P n P n ) = n{ *P n P n e, e) n - Tr( P n P n ) 

n n 

— ^ ^ ^ ^ ( PnPn)ij£-i£j Tr( PiPn) 
i=l 3=1 

n n 

= ^(* P n P ")»( £ i - !) + 2 'PnPnhSiEj . 

i=l i=l j>i 

Thus, dHSJ leads to 

P(<7 2 < (l-35)cr 2 ) ^Pi + P 2 + P 3 (89) 
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where we have set 



Y^^PnPn)^ - 1) < -5TrCP n (I n - %)P n ) 



vi=l 



\ = P [ 2 Y,( tp n p n)ij£i£j ^ STrCP n (I n - vr)P n ) 

i=l j>i 



and 



n 



WP n ef n - Tr(P n iTP n ) > 5Tr(P n (I n - n)P n ))) 



Note that Pi concerns a sum of independent centered random variables. By Markov's 
inequality and (|2ip . we get 



Y^CPnPnUej ~ I) 



><5Tir(*P n (J n -7r)P n ) 



< 5-^ 2 TvCP n (In-^Pn)- p/2 ^ 



<: 2 p / 2 5- p / 2 TV(*P n P ri )-P/ 2 E 



J2CPnPnHs!-l) 
p/2~ 



i=l 



p/2 



J2CPnPn)iM-l) 



i=l 



(90) 



If p ^ 4 then we use the Rosenthal Inequality (see Chapter 2 of |31j ) and (|57p to obtain 

^ P/2" 



E 



£(*„P B )«(e?-l) 



i=l 



n / n \ P/4 N 

'in D W 2 , / Y^i D D ^2 



i=l \i=l / 

Since, for any i £ {1, . . . , n}, (*P n P n )ii ^ p 2 (P n ), by a convexity argument, we get 

P/2" 



E 



^(*P n P n )»(£ 2 -l) 



^ 2C'(p)r pP P/ 2 (P n )TiCP n P n r / 4 . 



If 2 < p < 4, we refer to [39] for the following inequality 

P/2] 



E 



^(*P n P n ) 44 (e 2 -l) 



< 2^T \CP n PnUe 2 - 1)\ P/2 ^ C"(p)r pf f- 2 (P n )TiCP n P n ) . 



i=i 



In both cases, (1901) becomes 



*! < C(p)5- p / 2 T p ff/ 2 (P n )Tr(P n P n y 



(91) 



with /3 = (p/2 - 1) Ap/4. 
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Let us now bound P2. By Chebyshev's inequality, we get 

2 <5Tr(P n (/ n - 7r)P n ) 



2 ^ ' ^ ^ ( Pn,Pn)ij£i£j 
i=l j>i 



^ 6- 2 TiCP n (I n -7r)P n r 2 E 



2 ^ ^ ^""^ ( PnPn)ij£-i£j 
i=l j>i 



n n 



< 4<T 2 Tr( tPnPn)- 2 Y,^ PnPn h( tp nPn)p q ne i e j e P e q ] . 

i=l j>« p=l g>p 

Note that, by independence between the components of e, the expectation in the last sum is 
not null if and only if i = p and j = q (in this case, its value is 1). Thus, we have 



< 4,T 2 Tr( tPnPn)- 2 E( 

i=l j>i 

^ 4,T 2 Tr(P n P n r 2 Tr((P n P n ) 2 ) 

< A5- 2 p 2 (P n )Tr( t P n P n )- 1 . 



(92) 



We finally focus on P3. Recalling (f85|) . we apply Corollary 5.1 of [4] with A = P n 7r'P n to 
obtain, for any x > 0, 



P {n\WP n e\\ 2 n 2(1 + <5/2)Tr(*P n vrP n ) + (1 + 5/2)p 2 (P n ) + (1 + 2/S)x) 
s$ P (n||7r'P n e|| 2 2 (1 + ty2)Tr(*P n 7r'P n ) + (1 + 2/S)x) 



< P (n|K'P n £|| 2 - Tr(*P n vr'Pn) > 2 ^Tr( P n vr'P n )x + x 

^ C(p)T p Tl( 'P^'P^p^'PnY^X-P/ 2 

< C(p)r p Tr(*P n P n )/- 2 (P n )x-^ 2 . 



Thus, for any 1 £ R, we define 

, bM - J C(p)r p Tr(P n P n )pf- 2 (P n )x-f/ 2 A 1 if * > 
W " 1 1 ifx^O 

and t/j(x) is an upperbound for 

P (n\\ir'P n e\\ 2 n > (1 + <5/2)Tr( P n ^P n ) + (1 + S/2)p 2 (P n ) + (1 + 2/5)x) 

If we take 

Vr( P n (/ n - vr)P n ) - ^Tr( *P n vrP n ) - f 1 + i ) p 2 (P n 



5 + 2 
5 

6 + 2 



5Tr(P n P n ) - yTr('P n 7rP n ) - + |j p 2 (P n )) 



43 



then we obtain 



P 3 < C'( P ,5)t } 



Tr(*P n P n )pP- 2 (P n ) 



A 1 



p 



(5Tr(tp n P n )/4 - (1 + 5/2) p?(P n )) 



,P/2 



+ 



< C"(p,5)r } 



p 



(1-2(1 + 2/5) p 2 (P n )/Tr(*P n P n )> 



p/2 



A 1 



(93) 



+ 



To conclude, we use flU]), ([92]) and (l93l) in ([89]) . 



□ 
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